From 139dfad6cfa0ff816ea06d70132b164a44257c12 Mon Sep 17 00:00:00 2001 From: Rocky Liao Date: Wed, 25 Mar 2020 10:26:38 +0800 Subject: dt-bindings: net: bluetooth: Add device tree bindings for QCA chip QCA6390 This patch adds compatible string for the QCA chip QCA6390. Signed-off-by: Rocky Liao Signed-off-by: Marcel Holtmann --- Documentation/devicetree/bindings/net/qualcomm-bluetooth.txt | 1 + 1 file changed, 1 insertion(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/qualcomm-bluetooth.txt b/Documentation/devicetree/bindings/net/qualcomm-bluetooth.txt index beca6466d59a..badf597c0e58 100644 --- a/Documentation/devicetree/bindings/net/qualcomm-bluetooth.txt +++ b/Documentation/devicetree/bindings/net/qualcomm-bluetooth.txt @@ -13,6 +13,7 @@ Required properties: * "qcom,wcn3990-bt" * "qcom,wcn3991-bt" * "qcom,wcn3998-bt" + * "qcom,qca6390-bt" Optional properties for compatible string qcom,qca6174-bt: -- cgit v1.2.3 From 3e782985cb3ce00a32c372b37d8feefdae18ddf1 Mon Sep 17 00:00:00 2001 From: Andrew Lunn Date: Mon, 20 Apr 2020 00:04:01 +0200 Subject: net: ethernet: fec: Allow configuration of MDIO bus speed MDIO busses typically operate at 2.5MHz. However many devices can operate at faster speeds. This then allows more MDIO transactions per second, useful for Ethernet switch statistics, or Ethernet PHY TDR data. Allow the bus speed to be configured, using the standard "clock-frequency" property, which i2c busses use to indicate the bus speed. Before using this property, ensure all devices on the bus do actually support the requested clock speed. Suggested-by: Chris Healy Signed-off-by: Andrew Lunn Reviewed-by: Florian Fainelli Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/fsl-fec.txt | 1 + Documentation/devicetree/bindings/net/mdio.yaml | 6 ++++++ drivers/net/ethernet/freescale/fec_main.c | 11 ++++++++--- 3 files changed, 15 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/fsl-fec.txt b/Documentation/devicetree/bindings/net/fsl-fec.txt index ff8b0f211aa1..26c492a2e0e1 100644 --- a/Documentation/devicetree/bindings/net/fsl-fec.txt +++ b/Documentation/devicetree/bindings/net/fsl-fec.txt @@ -82,6 +82,7 @@ ethernet@83fec000 { phy-supply = <®_fec_supply>; phy-handle = <ðphy>; mdio { + clock-frequency = <5000000>; ethphy: ethernet-phy@6 { compatible = "ethernet-phy-ieee802.3-c22"; reg = <6>; diff --git a/Documentation/devicetree/bindings/net/mdio.yaml b/Documentation/devicetree/bindings/net/mdio.yaml index 50c3397a82bc..ab4a9df8b8e2 100644 --- a/Documentation/devicetree/bindings/net/mdio.yaml +++ b/Documentation/devicetree/bindings/net/mdio.yaml @@ -39,6 +39,12 @@ properties: and must therefore be appropriately determined based on all PHY requirements (maximum value of all per-PHY RESET pulse widths). + clock-frequency: + description: + Desired MDIO bus clock frequency in Hz. Values greater than IEEE 802.3 + defined 2.5MHz should only be used when all devices on the bus support + the given clock speed. + patternProperties: "^ethernet-phy@[0-9a-f]+$": type: object diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c index 2267bf75784e..832a24e2805c 100644 --- a/drivers/net/ethernet/freescale/fec_main.c +++ b/drivers/net/ethernet/freescale/fec_main.c @@ -2067,6 +2067,7 @@ static int fec_enet_mii_init(struct platform_device *pdev) struct device_node *node; int err = -ENXIO; u32 mii_speed, holdtime; + u32 bus_freq; /* * The i.MX28 dual fec interfaces are not equal. @@ -2094,15 +2095,20 @@ static int fec_enet_mii_init(struct platform_device *pdev) return -ENOENT; } + bus_freq = 2500000; /* 2.5MHz by default */ + node = of_get_child_by_name(pdev->dev.of_node, "mdio"); + if (node) + of_property_read_u32(node, "clock-frequency", &bus_freq); + /* - * Set MII speed to 2.5 MHz (= clk_get_rate() / 2 * phy_speed) + * Set MII speed (= clk_get_rate() / 2 * phy_speed) * * The formula for FEC MDC is 'ref_freq / (MII_SPEED x 2)' while * for ENET-MAC is 'ref_freq / ((MII_SPEED + 1) x 2)'. The i.MX28 * Reference Manual has an error on this, and gets fixed on i.MX6Q * document. */ - mii_speed = DIV_ROUND_UP(clk_get_rate(fep->clk_ipg), 5000000); + mii_speed = DIV_ROUND_UP(clk_get_rate(fep->clk_ipg), bus_freq * 2); if (fep->quirks & FEC_QUIRK_ENET_MAC) mii_speed--; if (mii_speed > 63) { @@ -2148,7 +2154,6 @@ static int fec_enet_mii_init(struct platform_device *pdev) fep->mii_bus->priv = fep; fep->mii_bus->parent = &pdev->dev; - node = of_get_child_by_name(pdev->dev.of_node, "mdio"); err = of_mdiobus_register(fep->mii_bus, node); of_node_put(node); if (err) -- cgit v1.2.3 From 3c01eb62d1bd85a5dd1d22d74339728666ae2c45 Mon Sep 17 00:00:00 2001 From: Andrew Lunn Date: Mon, 20 Apr 2020 00:04:02 +0200 Subject: net: ethernet: fec: Allow the MDIO preamble to be disabled An MDIO transaction normally starts with 32 1s as a preamble. However not all devices requires such a preamble. Add a device tree property which allows the preamble to be suppressed. This will half the size of the MDIO transaction, allowing faster transactions. But it should only be used when all devices on the bus support suppressed preamble. Suggested-by: Chris Healy Signed-off-by: Andrew Lunn Reviewed-by: Florian Fainelli Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/mdio.yaml | 6 ++++++ drivers/net/ethernet/freescale/fec_main.c | 9 ++++++++- 2 files changed, 14 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/mdio.yaml b/Documentation/devicetree/bindings/net/mdio.yaml index ab4a9df8b8e2..cd6c6ae6dabb 100644 --- a/Documentation/devicetree/bindings/net/mdio.yaml +++ b/Documentation/devicetree/bindings/net/mdio.yaml @@ -45,6 +45,12 @@ properties: defined 2.5MHz should only be used when all devices on the bus support the given clock speed. + suppress-preamble: + description: + The 32 bit preamble should be suppressed. In order for this to + work, all devices on the bus must support suppressed preamble. + type: boolean + patternProperties: "^ethernet-phy@[0-9a-f]+$": type: object diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c index 832a24e2805c..1ae075a246a3 100644 --- a/drivers/net/ethernet/freescale/fec_main.c +++ b/drivers/net/ethernet/freescale/fec_main.c @@ -2064,6 +2064,7 @@ static int fec_enet_mii_init(struct platform_device *pdev) static struct mii_bus *fec0_mii_bus; struct net_device *ndev = platform_get_drvdata(pdev); struct fec_enet_private *fep = netdev_priv(ndev); + bool suppress_preamble = false; struct device_node *node; int err = -ENXIO; u32 mii_speed, holdtime; @@ -2097,8 +2098,11 @@ static int fec_enet_mii_init(struct platform_device *pdev) bus_freq = 2500000; /* 2.5MHz by default */ node = of_get_child_by_name(pdev->dev.of_node, "mdio"); - if (node) + if (node) { of_property_read_u32(node, "clock-frequency", &bus_freq); + suppress_preamble = of_property_read_bool(node, + "suppress-preamble"); + } /* * Set MII speed (= clk_get_rate() / 2 * phy_speed) @@ -2135,6 +2139,9 @@ static int fec_enet_mii_init(struct platform_device *pdev) fep->phy_speed = mii_speed << 1 | holdtime << 8; + if (suppress_preamble) + fep->phy_speed |= BIT(7); + writel(fep->phy_speed, fep->hwp + FEC_MII_SPEED); /* Clear any pending transaction complete indication */ -- cgit v1.2.3 From db30a57779b18b7cef092c21887ed2d23ad2bd35 Mon Sep 17 00:00:00 2001 From: Andrew Lunn Date: Mon, 20 Apr 2020 00:11:51 +0200 Subject: net: Add testing sysfs attribute Similar to speed, duplex and dorment, report the testing status in sysfs. Signed-off-by: Andrew Lunn Reviewed-by: Florian Fainelli Signed-off-by: David S. Miller --- Documentation/ABI/testing/sysfs-class-net | 13 +++++++++++++ net/core/net-sysfs.c | 15 ++++++++++++++- 2 files changed, 27 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-class-net b/Documentation/ABI/testing/sysfs-class-net index 664a8f6a634f..3b404577f380 100644 --- a/Documentation/ABI/testing/sysfs-class-net +++ b/Documentation/ABI/testing/sysfs-class-net @@ -124,6 +124,19 @@ Description: authentication is performed (e.g: 802.1x). 'link_mode' attribute will also reflect the dormant state. +What: /sys/class/net//testing +Date: April 2002 +KernelVersion: 5.8 +Contact: netdev@vger.kernel.org +Description: + Indicates whether the interface is under test. Possible + values are: + 0: interface is not being tested + 1: interface is being tested + + When an interface is under test, it cannot be expected + to pass packets as normal. + What: /sys/clas/net//duplex Date: October 2009 KernelVersion: 2.6.33 diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index 4773ad6ec111..0d9e46de205e 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -243,6 +243,18 @@ static ssize_t duplex_show(struct device *dev, } static DEVICE_ATTR_RO(duplex); +static ssize_t testing_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct net_device *netdev = to_net_dev(dev); + + if (netif_running(netdev)) + return sprintf(buf, fmt_dec, !!netif_testing(netdev)); + + return -EINVAL; +} +static DEVICE_ATTR_RO(testing); + static ssize_t dormant_show(struct device *dev, struct device_attribute *attr, char *buf) { @@ -260,7 +272,7 @@ static const char *const operstates[] = { "notpresent", /* currently unused */ "down", "lowerlayerdown", - "testing", /* currently unused */ + "testing", "dormant", "up" }; @@ -524,6 +536,7 @@ static struct attribute *net_class_attrs[] __ro_after_init = { &dev_attr_speed.attr, &dev_attr_duplex.attr, &dev_attr_dormant.attr, + &dev_attr_testing.attr, &dev_attr_operstate.attr, &dev_attr_carrier_changes.attr, &dev_attr_ifalias.attr, -- cgit v1.2.3 From f42ceca226cadf2c27709499af8643acd4281cd7 Mon Sep 17 00:00:00 2001 From: Florian Fainelli Date: Mon, 20 Apr 2020 11:07:21 -0700 Subject: dt-bindings: net: Correct description of 'broken-turn-around' The turn around bytes (2) are placed between the control phase of the MDIO transaction and the data phase, correct the wording to be more exact. Signed-off-by: Florian Fainelli Reviewed-by: Andrew Lunn Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/ethernet-phy.yaml | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/ethernet-phy.yaml b/Documentation/devicetree/bindings/net/ethernet-phy.yaml index 5aa141ccc113..9b1f1147ca36 100644 --- a/Documentation/devicetree/bindings/net/ethernet-phy.yaml +++ b/Documentation/devicetree/bindings/net/ethernet-phy.yaml @@ -81,7 +81,8 @@ properties: $ref: /schemas/types.yaml#definitions/flag description: If set, indicates the PHY device does not correctly release - the turn around line low at the end of a MDIO transaction. + the turn around line low at end of the control phase of the + MDIO transaction. enet-phy-lane-swap: $ref: /schemas/types.yaml#definitions/flag -- cgit v1.2.3 From b92d905f2c9c5e66868fae26ba9ed32352df0f5a Mon Sep 17 00:00:00 2001 From: Florian Fainelli Date: Mon, 20 Apr 2020 11:07:22 -0700 Subject: dt-bindings: net: mdio: Document common properties Some of the properties pertaining to the broken turn around or resets were only documented in ethernet-phy.yaml while they are applicable across all MDIO devices and not Ethernet PHYs specifically which are a superset. Signed-off-by: Florian Fainelli Reviewed-by: Andrew Lunn Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/mdio.yaml | 28 +++++++++++++++++++++++++ 1 file changed, 28 insertions(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/mdio.yaml b/Documentation/devicetree/bindings/net/mdio.yaml index cd6c6ae6dabb..37b9579b5651 100644 --- a/Documentation/devicetree/bindings/net/mdio.yaml +++ b/Documentation/devicetree/bindings/net/mdio.yaml @@ -62,6 +62,34 @@ patternProperties: description: The ID number for the PHY. + broken-turn-around: + $ref: /schemas/types.yaml#definitions/flag + description: + If set, indicates the MDIO device does not correctly release + the turn around line low at end of the control phase of the + MDIO transaction. + + resets: + maxItems: 1 + + reset-names: + const: phy + + reset-gpios: + maxItems: 1 + description: + The GPIO phandle and specifier for the MDIO reset signal. + + reset-assert-us: + description: + Delay after the reset was asserted in microseconds. If this + property is missing the delay will be skipped. + + reset-deassert-us: + description: + Delay after the reset was deasserted in microseconds. If + this property is missing the delay will be skipped. + required: - reg -- cgit v1.2.3 From 630c3ff8c3d554054229b8c1c3d3a2b9465ffa64 Mon Sep 17 00:00:00 2001 From: Florian Fainelli Date: Mon, 20 Apr 2020 11:07:23 -0700 Subject: dt-bindings: net: mdio: Make descriptions more general A number of descriptions assume a PHY device, but since this binding describes a MDIO bus which can have different kinds of MDIO devices attached to it, rephrase some descriptions to be more general in that regard. Reviewed-by: Andrew Lunn Signed-off-by: Florian Fainelli Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/mdio.yaml | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/mdio.yaml b/Documentation/devicetree/bindings/net/mdio.yaml index 37b9579b5651..d6a3bf8550eb 100644 --- a/Documentation/devicetree/bindings/net/mdio.yaml +++ b/Documentation/devicetree/bindings/net/mdio.yaml @@ -31,13 +31,13 @@ properties: maxItems: 1 description: The phandle and specifier for the GPIO that controls the RESET - lines of all PHYs on that MDIO bus. + lines of all devices on that MDIO bus. reset-delay-us: description: - RESET pulse width in microseconds. It applies to all PHY devices - and must therefore be appropriately determined based on all PHY - requirements (maximum value of all per-PHY RESET pulse widths). + RESET pulse width in microseconds. It applies to all MDIO devices + and must therefore be appropriately determined based on all devices + requirements (maximum value of all per-device RESET pulse widths). clock-frequency: description: @@ -60,7 +60,7 @@ patternProperties: minimum: 0 maximum: 31 description: - The ID number for the PHY. + The ID number for the device. broken-turn-around: $ref: /schemas/types.yaml#definitions/flag -- cgit v1.2.3 From 4406d36dfdf1fbd954400e16ffeb915c1907d58a Mon Sep 17 00:00:00 2001 From: Michael Walle Date: Mon, 20 Apr 2020 20:21:13 +0200 Subject: net: phy: bcm54140: add hwmon support The PHY supports monitoring its die temperature as well as two analog voltages. Add support for it. Signed-off-by: Michael Walle Acked-by: Guenter Roeck Reviewed-by: Andrew Lunn Signed-off-by: David S. Miller --- Documentation/hwmon/bcm54140.rst | 45 +++++ Documentation/hwmon/index.rst | 1 + drivers/net/phy/Kconfig | 1 + drivers/net/phy/bcm54140.c | 396 +++++++++++++++++++++++++++++++++++++++ 4 files changed, 443 insertions(+) create mode 100644 Documentation/hwmon/bcm54140.rst (limited to 'Documentation') diff --git a/Documentation/hwmon/bcm54140.rst b/Documentation/hwmon/bcm54140.rst new file mode 100644 index 000000000000..bc6ea4b45966 --- /dev/null +++ b/Documentation/hwmon/bcm54140.rst @@ -0,0 +1,45 @@ +.. SPDX-License-Identifier: GPL-2.0-only + +Broadcom BCM54140 Quad SGMII/QSGMII PHY +======================================= + +Supported chips: + + * Broadcom BCM54140 + + Datasheet: not public + +Author: Michael Walle + +Description +----------- + +The Broadcom BCM54140 is a Quad SGMII/QSGMII PHY which supports monitoring +its die temperature as well as two analog voltages. + +The AVDDL is a 1.0V analogue voltage, the AVDDH is a 3.3V analogue voltage. +Both voltages and the temperature are measured in a round-robin fashion. + +Sysfs entries +------------- + +The following attributes are supported. + +======================= ======================================================== +in0_label "AVDDL" +in0_input Measured AVDDL voltage. +in0_min Minimum AVDDL voltage. +in0_max Maximum AVDDL voltage. +in0_alarm AVDDL voltage alarm. + +in1_label "AVDDH" +in1_input Measured AVDDH voltage. +in1_min Minimum AVDDH voltage. +in1_max Maximum AVDDH voltage. +in1_alarm AVDDH voltage alarm. + +temp1_input Die temperature. +temp1_min Minimum die temperature. +temp1_max Maximum die temperature. +temp1_alarm Die temperature alarm. +======================= ======================================================== diff --git a/Documentation/hwmon/index.rst b/Documentation/hwmon/index.rst index 8ef62fd39787..1f0affb3b6e0 100644 --- a/Documentation/hwmon/index.rst +++ b/Documentation/hwmon/index.rst @@ -42,6 +42,7 @@ Hardware Monitoring Kernel Drivers asb100 asc7621 aspeed-pwm-tacho + bcm54140 bel-pfe coretemp da9052 diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig index cb7936b577de..bacfee41b564 100644 --- a/drivers/net/phy/Kconfig +++ b/drivers/net/phy/Kconfig @@ -349,6 +349,7 @@ config BROADCOM_PHY config BCM54140_PHY tristate "Broadcom BCM54140 PHY" depends on PHYLIB + depends on HWMON || HWMON=n select BCM_NET_PHYLIB help Support the Broadcom BCM54140 Quad SGMII/QSGMII PHY. diff --git a/drivers/net/phy/bcm54140.c b/drivers/net/phy/bcm54140.c index 0eeb60de67f8..aa854477e06a 100644 --- a/drivers/net/phy/bcm54140.c +++ b/drivers/net/phy/bcm54140.c @@ -6,6 +6,7 @@ #include #include +#include #include #include @@ -50,6 +51,69 @@ #define BCM54140_RDB_TOP_IMR_PORT1 BIT(5) #define BCM54140_RDB_TOP_IMR_PORT2 BIT(6) #define BCM54140_RDB_TOP_IMR_PORT3 BIT(7) +#define BCM54140_RDB_MON_CTRL 0x831 /* monitor control */ +#define BCM54140_RDB_MON_CTRL_V_MODE BIT(3) /* voltage mode */ +#define BCM54140_RDB_MON_CTRL_SEL_MASK GENMASK(2, 1) +#define BCM54140_RDB_MON_CTRL_SEL_TEMP 0 /* meassure temperature */ +#define BCM54140_RDB_MON_CTRL_SEL_1V0 1 /* meassure AVDDL 1.0V */ +#define BCM54140_RDB_MON_CTRL_SEL_3V3 2 /* meassure AVDDH 3.3V */ +#define BCM54140_RDB_MON_CTRL_SEL_RR 3 /* meassure all round-robin */ +#define BCM54140_RDB_MON_CTRL_PWR_DOWN BIT(0) /* power-down monitor */ +#define BCM54140_RDB_MON_TEMP_VAL 0x832 /* temperature value */ +#define BCM54140_RDB_MON_TEMP_MAX 0x833 /* temperature high thresh */ +#define BCM54140_RDB_MON_TEMP_MIN 0x834 /* temperature low thresh */ +#define BCM54140_RDB_MON_TEMP_DATA_MASK GENMASK(9, 0) +#define BCM54140_RDB_MON_1V0_VAL 0x835 /* AVDDL 1.0V value */ +#define BCM54140_RDB_MON_1V0_MAX 0x836 /* AVDDL 1.0V high thresh */ +#define BCM54140_RDB_MON_1V0_MIN 0x837 /* AVDDL 1.0V low thresh */ +#define BCM54140_RDB_MON_1V0_DATA_MASK GENMASK(10, 0) +#define BCM54140_RDB_MON_3V3_VAL 0x838 /* AVDDH 3.3V value */ +#define BCM54140_RDB_MON_3V3_MAX 0x839 /* AVDDH 3.3V high thresh */ +#define BCM54140_RDB_MON_3V3_MIN 0x83a /* AVDDH 3.3V low thresh */ +#define BCM54140_RDB_MON_3V3_DATA_MASK GENMASK(11, 0) +#define BCM54140_RDB_MON_ISR 0x83b /* interrupt status */ +#define BCM54140_RDB_MON_ISR_3V3 BIT(2) /* AVDDH 3.3V alarm */ +#define BCM54140_RDB_MON_ISR_1V0 BIT(1) /* AVDDL 1.0V alarm */ +#define BCM54140_RDB_MON_ISR_TEMP BIT(0) /* temperature alarm */ + +/* According to the datasheet the formula is: + * T = 413.35 - (0.49055 * bits[9:0]) + */ +#define BCM54140_HWMON_TO_TEMP(v) (413350L - (v) * 491) +#define BCM54140_HWMON_FROM_TEMP(v) DIV_ROUND_CLOSEST_ULL(413350L - (v), 491) + +/* According to the datasheet the formula is: + * U = bits[11:0] / 1024 * 220 / 0.2 + * + * Normalized: + * U = bits[11:0] / 4096 * 2514 + */ +#define BCM54140_HWMON_TO_IN_1V0(v) ((v) * 2514 >> 11) +#define BCM54140_HWMON_FROM_IN_1V0(v) DIV_ROUND_CLOSEST_ULL(((v) << 11), 2514) + +/* According to the datasheet the formula is: + * U = bits[10:0] / 1024 * 880 / 0.7 + * + * Normalized: + * U = bits[10:0] / 2048 * 4400 + */ +#define BCM54140_HWMON_TO_IN_3V3(v) ((v) * 4400 >> 12) +#define BCM54140_HWMON_FROM_IN_3V3(v) DIV_ROUND_CLOSEST_ULL(((v) << 12), 4400) + +#define BCM54140_HWMON_TO_IN(ch, v) ((ch) ? BCM54140_HWMON_TO_IN_3V3(v) \ + : BCM54140_HWMON_TO_IN_1V0(v)) +#define BCM54140_HWMON_FROM_IN(ch, v) ((ch) ? BCM54140_HWMON_FROM_IN_3V3(v) \ + : BCM54140_HWMON_FROM_IN_1V0(v)) +#define BCM54140_HWMON_IN_MASK(ch) ((ch) ? BCM54140_RDB_MON_3V3_DATA_MASK \ + : BCM54140_RDB_MON_1V0_DATA_MASK) +#define BCM54140_HWMON_IN_VAL_REG(ch) ((ch) ? BCM54140_RDB_MON_3V3_VAL \ + : BCM54140_RDB_MON_1V0_VAL) +#define BCM54140_HWMON_IN_MIN_REG(ch) ((ch) ? BCM54140_RDB_MON_3V3_MIN \ + : BCM54140_RDB_MON_1V0_MIN) +#define BCM54140_HWMON_IN_MAX_REG(ch) ((ch) ? BCM54140_RDB_MON_3V3_MAX \ + : BCM54140_RDB_MON_1V0_MAX) +#define BCM54140_HWMON_IN_ALARM_BIT(ch) ((ch) ? BCM54140_RDB_MON_ISR_3V3 \ + : BCM54140_RDB_MON_ISR_1V0) #define BCM54140_DEFAULT_DOWNSHIFT 5 #define BCM54140_MAX_DOWNSHIFT 9 @@ -57,8 +121,328 @@ struct bcm54140_priv { int port; int base_addr; +#if IS_ENABLED(CONFIG_HWMON) + bool pkg_init; + /* protect the alarm bits */ + struct mutex alarm_lock; + u16 alarm; +#endif }; +#if IS_ENABLED(CONFIG_HWMON) +static umode_t bcm54140_hwmon_is_visible(const void *data, + enum hwmon_sensor_types type, + u32 attr, int channel) +{ + switch (type) { + case hwmon_in: + switch (attr) { + case hwmon_in_min: + case hwmon_in_max: + return 0644; + case hwmon_in_label: + case hwmon_in_input: + case hwmon_in_alarm: + return 0444; + default: + return 0; + } + case hwmon_temp: + switch (attr) { + case hwmon_temp_min: + case hwmon_temp_max: + return 0644; + case hwmon_temp_input: + case hwmon_temp_alarm: + return 0444; + default: + return 0; + } + default: + return 0; + } +} + +static int bcm54140_hwmon_read_alarm(struct device *dev, unsigned int bit, + long *val) +{ + struct phy_device *phydev = dev_get_drvdata(dev); + struct bcm54140_priv *priv = phydev->priv; + int tmp, ret = 0; + + mutex_lock(&priv->alarm_lock); + + /* latch any alarm bits */ + tmp = bcm_phy_read_rdb(phydev, BCM54140_RDB_MON_ISR); + if (tmp < 0) { + ret = tmp; + goto out; + } + priv->alarm |= tmp; + + *val = !!(priv->alarm & bit); + priv->alarm &= ~bit; + +out: + mutex_unlock(&priv->alarm_lock); + return ret; +} + +static int bcm54140_hwmon_read_temp(struct device *dev, u32 attr, long *val) +{ + struct phy_device *phydev = dev_get_drvdata(dev); + u16 reg, tmp; + + switch (attr) { + case hwmon_temp_input: + reg = BCM54140_RDB_MON_TEMP_VAL; + break; + case hwmon_temp_min: + reg = BCM54140_RDB_MON_TEMP_MIN; + break; + case hwmon_temp_max: + reg = BCM54140_RDB_MON_TEMP_MAX; + break; + case hwmon_temp_alarm: + return bcm54140_hwmon_read_alarm(dev, + BCM54140_RDB_MON_ISR_TEMP, + val); + default: + return -EOPNOTSUPP; + } + + tmp = bcm_phy_read_rdb(phydev, reg); + if (tmp < 0) + return tmp; + + *val = BCM54140_HWMON_TO_TEMP(tmp & BCM54140_RDB_MON_TEMP_DATA_MASK); + + return 0; +} + +static int bcm54140_hwmon_read_in(struct device *dev, u32 attr, + int channel, long *val) +{ + struct phy_device *phydev = dev_get_drvdata(dev); + u16 bit, reg, tmp; + + switch (attr) { + case hwmon_in_input: + reg = BCM54140_HWMON_IN_VAL_REG(channel); + break; + case hwmon_in_min: + reg = BCM54140_HWMON_IN_MIN_REG(channel); + break; + case hwmon_in_max: + reg = BCM54140_HWMON_IN_MAX_REG(channel); + break; + case hwmon_in_alarm: + bit = BCM54140_HWMON_IN_ALARM_BIT(channel); + return bcm54140_hwmon_read_alarm(dev, bit, val); + default: + return -EOPNOTSUPP; + } + + tmp = bcm_phy_read_rdb(phydev, reg); + if (tmp < 0) + return tmp; + + tmp &= BCM54140_HWMON_IN_MASK(channel); + *val = BCM54140_HWMON_TO_IN(channel, tmp); + + return 0; +} + +static int bcm54140_hwmon_read(struct device *dev, + enum hwmon_sensor_types type, u32 attr, + int channel, long *val) +{ + switch (type) { + case hwmon_temp: + return bcm54140_hwmon_read_temp(dev, attr, val); + case hwmon_in: + return bcm54140_hwmon_read_in(dev, attr, channel, val); + default: + return -EOPNOTSUPP; + } +} + +static const char *const bcm54140_hwmon_in_labels[] = { + "AVDDL", + "AVDDH", +}; + +static int bcm54140_hwmon_read_string(struct device *dev, + enum hwmon_sensor_types type, u32 attr, + int channel, const char **str) +{ + switch (type) { + case hwmon_in: + switch (attr) { + case hwmon_in_label: + *str = bcm54140_hwmon_in_labels[channel]; + return 0; + default: + return -EOPNOTSUPP; + } + default: + return -EOPNOTSUPP; + } +} + +static int bcm54140_hwmon_write_temp(struct device *dev, u32 attr, + int channel, long val) +{ + struct phy_device *phydev = dev_get_drvdata(dev); + u16 mask = BCM54140_RDB_MON_TEMP_DATA_MASK; + u16 reg; + + val = clamp_val(val, BCM54140_HWMON_TO_TEMP(mask), + BCM54140_HWMON_TO_TEMP(0)); + + switch (attr) { + case hwmon_temp_min: + reg = BCM54140_RDB_MON_TEMP_MIN; + break; + case hwmon_temp_max: + reg = BCM54140_RDB_MON_TEMP_MAX; + break; + default: + return -EOPNOTSUPP; + } + + return bcm_phy_modify_rdb(phydev, reg, mask, + BCM54140_HWMON_FROM_TEMP(val)); +} + +static int bcm54140_hwmon_write_in(struct device *dev, u32 attr, + int channel, long val) +{ + struct phy_device *phydev = dev_get_drvdata(dev); + u16 mask = BCM54140_HWMON_IN_MASK(channel); + u16 reg; + + val = clamp_val(val, 0, BCM54140_HWMON_TO_IN(channel, mask)); + + switch (attr) { + case hwmon_in_min: + reg = BCM54140_HWMON_IN_MIN_REG(channel); + break; + case hwmon_in_max: + reg = BCM54140_HWMON_IN_MAX_REG(channel); + break; + default: + return -EOPNOTSUPP; + } + + return bcm_phy_modify_rdb(phydev, reg, mask, + BCM54140_HWMON_FROM_IN(channel, val)); +} + +static int bcm54140_hwmon_write(struct device *dev, + enum hwmon_sensor_types type, u32 attr, + int channel, long val) +{ + switch (type) { + case hwmon_temp: + return bcm54140_hwmon_write_temp(dev, attr, channel, val); + case hwmon_in: + return bcm54140_hwmon_write_in(dev, attr, channel, val); + default: + return -EOPNOTSUPP; + } +} + +static const struct hwmon_channel_info *bcm54140_hwmon_info[] = { + HWMON_CHANNEL_INFO(temp, + HWMON_T_INPUT | HWMON_T_MIN | HWMON_T_MAX | + HWMON_T_ALARM), + HWMON_CHANNEL_INFO(in, + HWMON_I_INPUT | HWMON_I_MIN | HWMON_I_MAX | + HWMON_I_ALARM | HWMON_I_LABEL, + HWMON_I_INPUT | HWMON_I_MIN | HWMON_I_MAX | + HWMON_I_ALARM | HWMON_I_LABEL), + NULL +}; + +static const struct hwmon_ops bcm54140_hwmon_ops = { + .is_visible = bcm54140_hwmon_is_visible, + .read = bcm54140_hwmon_read, + .read_string = bcm54140_hwmon_read_string, + .write = bcm54140_hwmon_write, +}; + +static const struct hwmon_chip_info bcm54140_chip_info = { + .ops = &bcm54140_hwmon_ops, + .info = bcm54140_hwmon_info, +}; + +static int bcm54140_enable_monitoring(struct phy_device *phydev) +{ + u16 mask, set; + + /* 3.3V voltage mode */ + set = BCM54140_RDB_MON_CTRL_V_MODE; + + /* select round-robin */ + mask = BCM54140_RDB_MON_CTRL_SEL_MASK; + set |= FIELD_PREP(BCM54140_RDB_MON_CTRL_SEL_MASK, + BCM54140_RDB_MON_CTRL_SEL_RR); + + /* remove power-down bit */ + mask |= BCM54140_RDB_MON_CTRL_PWR_DOWN; + + return bcm_phy_modify_rdb(phydev, BCM54140_RDB_MON_CTRL, mask, set); +} + +/* Check if one PHY has already done the init of the parts common to all PHYs + * in the Quad PHY package. + */ +static bool bcm54140_is_pkg_init(struct phy_device *phydev) +{ + struct bcm54140_priv *priv = phydev->priv; + struct mii_bus *bus = phydev->mdio.bus; + int base_addr = priv->base_addr; + struct phy_device *phy; + int i; + + /* Quad PHY */ + for (i = 0; i < 4; i++) { + phy = mdiobus_get_phy(bus, base_addr + i); + if (!phy) + continue; + + if ((phy->phy_id & phydev->drv->phy_id_mask) != + (phydev->drv->phy_id & phydev->drv->phy_id_mask)) + continue; + + priv = phy->priv; + + if (priv && priv->pkg_init) + return true; + } + + return false; +} + +static int bcm54140_probe_once(struct phy_device *phydev) +{ + struct device *hwmon; + int ret; + + /* enable hardware monitoring */ + ret = bcm54140_enable_monitoring(phydev); + if (ret) + return ret; + + hwmon = devm_hwmon_device_register_with_info(&phydev->mdio.dev, + "BCM54140", phydev, + &bcm54140_chip_info, + NULL); + return PTR_ERR_OR_ZERO(hwmon); +} +#endif + static int bcm54140_base_read_rdb(struct phy_device *phydev, u16 rdb) { struct bcm54140_priv *priv = phydev->priv; @@ -222,6 +606,18 @@ static int bcm54140_probe(struct phy_device *phydev) if (ret) return ret; +#if IS_ENABLED(CONFIG_HWMON) + mutex_init(&priv->alarm_lock); + + if (!bcm54140_is_pkg_init(phydev)) { + ret = bcm54140_probe_once(phydev); + if (ret) + return ret; + } + + priv->pkg_init = true; +#endif + phydev_dbg(phydev, "probed (port %d, base PHY address %d)\n", priv->port, priv->base_addr); -- cgit v1.2.3 From d9cc193cf0bf471630e203c369e60ab3735dd621 Mon Sep 17 00:00:00 2001 From: Oleksij Rempel Date: Wed, 22 Apr 2020 11:24:53 +0200 Subject: dt-bindings: net: phy: Add support for NXP TJA11xx Document the NXP TJA11xx PHY bindings. Signed-off-by: Oleksij Rempel Reviewed-by: Andrew Lunn Signed-off-by: David S. Miller --- .../devicetree/bindings/net/nxp,tja11xx.yaml | 61 ++++++++++++++++++++++ 1 file changed, 61 insertions(+) create mode 100644 Documentation/devicetree/bindings/net/nxp,tja11xx.yaml (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/nxp,tja11xx.yaml b/Documentation/devicetree/bindings/net/nxp,tja11xx.yaml new file mode 100644 index 000000000000..42be0255512b --- /dev/null +++ b/Documentation/devicetree/bindings/net/nxp,tja11xx.yaml @@ -0,0 +1,61 @@ +# SPDX-License-Identifier: GPL-2.0+ +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/net/nxp,tja11xx.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: NXP TJA11xx PHY + +maintainers: + - Andrew Lunn + - Florian Fainelli + - Heiner Kallweit + +description: + Bindings for NXP TJA11xx automotive PHYs + +allOf: + - $ref: ethernet-phy.yaml# + +patternProperties: + "^ethernet-phy@[0-9a-f]+$": + type: object + description: | + Some packages have multiple PHYs. Secondary PHY should be defines as + subnode of the first (parent) PHY. + + properties: + reg: + minimum: 0 + maximum: 31 + description: + The ID number for the child PHY. Should be +1 of parent PHY. + + required: + - reg + +examples: + - | + mdio { + #address-cells = <1>; + #size-cells = <0>; + + tja1101_phy0: ethernet-phy@4 { + reg = <0x4>; + }; + }; + - | + mdio { + #address-cells = <1>; + #size-cells = <0>; + + tja1102_phy0: ethernet-phy@4 { + reg = <0x4>; + #address-cells = <1>; + #size-cells = <0>; + + tja1102_phy1: ethernet-phy@5 { + reg = <0x5>; + }; + }; + }; -- cgit v1.2.3 From 3608a199749873edc992c74adf077c3a848121ad Mon Sep 17 00:00:00 2001 From: Oleksij Rempel Date: Fri, 24 Apr 2020 07:21:16 +0200 Subject: dt-bindings: net: convert qca,ar71xx documentation to yaml Now that we have the DT validation in place, let's convert the device tree bindings for the Atheros AR71XX over to a YAML schemas. Signed-off-by: Oleksij Rempel Signed-off-by: David S. Miller --- .../devicetree/bindings/net/qca,ar71xx.txt | 45 ----- .../devicetree/bindings/net/qca,ar71xx.yaml | 216 +++++++++++++++++++++ 2 files changed, 216 insertions(+), 45 deletions(-) delete mode 100644 Documentation/devicetree/bindings/net/qca,ar71xx.txt create mode 100644 Documentation/devicetree/bindings/net/qca,ar71xx.yaml (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/qca,ar71xx.txt b/Documentation/devicetree/bindings/net/qca,ar71xx.txt deleted file mode 100644 index 2a33e71ba72b..000000000000 --- a/Documentation/devicetree/bindings/net/qca,ar71xx.txt +++ /dev/null @@ -1,45 +0,0 @@ -Required properties: -- compatible: Should be "qca,-eth". Currently support compatibles are: - qca,ar7100-eth - Atheros AR7100 - qca,ar7240-eth - Atheros AR7240 - qca,ar7241-eth - Atheros AR7241 - qca,ar7242-eth - Atheros AR7242 - qca,ar9130-eth - Atheros AR9130 - qca,ar9330-eth - Atheros AR9330 - qca,ar9340-eth - Atheros AR9340 - qca,qca9530-eth - Qualcomm Atheros QCA9530 - qca,qca9550-eth - Qualcomm Atheros QCA9550 - qca,qca9560-eth - Qualcomm Atheros QCA9560 - -- reg : Address and length of the register set for the device -- interrupts : Should contain eth interrupt -- phy-mode : See ethernet.txt file in the same directory -- clocks: the clock used by the core -- clock-names: the names of the clock listed in the clocks property. These are - "eth" and "mdio". -- resets: Should contain phandles to the reset signals -- reset-names: Should contain the names of reset signal listed in the resets - property. These are "mac" and "mdio" - -Optional properties: -- phy-handle : phandle to the PHY device connected to this device. -- fixed-link : Assume a fixed link. See fixed-link.txt in the same directory. - Use instead of phy-handle. - -Optional subnodes: -- mdio : specifies the mdio bus, used as a container for phy nodes - according to phy.txt in the same directory - -Example: - -ethernet@1a000000 { - compatible = "qca,ar9330-eth"; - reg = <0x1a000000 0x200>; - interrupts = <5>; - resets = <&rst 13>, <&rst 23>; - reset-names = "mac", "mdio"; - clocks = <&pll ATH79_CLK_AHB>, <&pll ATH79_CLK_MDIO>; - clock-names = "eth", "mdio"; - - phy-mode = "gmii"; -}; diff --git a/Documentation/devicetree/bindings/net/qca,ar71xx.yaml b/Documentation/devicetree/bindings/net/qca,ar71xx.yaml new file mode 100644 index 000000000000..f99a5aabe923 --- /dev/null +++ b/Documentation/devicetree/bindings/net/qca,ar71xx.yaml @@ -0,0 +1,216 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/net/qca,ar71xx.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: QCA AR71XX MAC + +allOf: + - $ref: ethernet-controller.yaml# + +maintainers: + - Oleksij Rempel + +properties: + compatible: + oneOf: + - items: + - enum: + - qca,ar7100-eth # Atheros AR7100 + - qca,ar7240-eth # Atheros AR7240 + - qca,ar7241-eth # Atheros AR7241 + - qca,ar7242-eth # Atheros AR7242 + - qca,ar9130-eth # Atheros AR9130 + - qca,ar9330-eth # Atheros AR9330 + - qca,ar9340-eth # Atheros AR9340 + - qca,qca9530-eth # Qualcomm Atheros QCA9530 + - qca,qca9550-eth # Qualcomm Atheros QCA9550 + - qca,qca9560-eth # Qualcomm Atheros QCA9560 + + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + + '#address-cells': + description: number of address cells for the MDIO bus + const: 1 + + '#size-cells': + description: number of size cells on the MDIO bus + const: 0 + + clocks: + items: + - description: MAC main clock + - description: MDIO clock + + clock-names: + items: + - const: eth + - const: mdio + + resets: + items: + - description: MAC reset + - description: MDIO reset + + reset-names: + items: + - const: mac + - const: mdio + +required: + - compatible + - reg + - interrupts + - phy-mode + - clocks + - clock-names + - resets + - reset-names + +examples: + # Lager board + - | + eth0: ethernet@19000000 { + compatible = "qca,ar9330-eth"; + reg = <0x19000000 0x200>; + interrupts = <4>; + resets = <&rst 9>, <&rst 22>; + reset-names = "mac", "mdio"; + clocks = <&pll 1>, <&pll 2>; + clock-names = "eth", "mdio"; + qca,ethcfg = <ðcfg>; + phy-mode = "mii"; + phy-handle = <&phy_port4>; + }; + + eth1: ethernet@1a000000 { + compatible = "qca,ar9330-eth"; + reg = <0x1a000000 0x200>; + interrupts = <5>; + resets = <&rst 13>, <&rst 23>; + reset-names = "mac", "mdio"; + clocks = <&pll 1>, <&pll 2>; + clock-names = "eth", "mdio"; + + phy-mode = "gmii"; + + status = "disabled"; + + fixed-link { + speed = <1000>; + full-duplex; + }; + + mdio { + #address-cells = <1>; + #size-cells = <0>; + + switch10: switch@10 { + #address-cells = <1>; + #size-cells = <0>; + + compatible = "qca,ar9331-switch"; + reg = <0x10>; + resets = <&rst 8>; + reset-names = "switch"; + + interrupt-parent = <&miscintc>; + interrupts = <12>; + + interrupt-controller; + #interrupt-cells = <1>; + + ports { + #address-cells = <1>; + #size-cells = <0>; + + switch_port0: port@0 { + reg = <0x0>; + label = "cpu"; + ethernet = <ð1>; + + phy-mode = "gmii"; + + fixed-link { + speed = <1000>; + full-duplex; + }; + }; + + switch_port1: port@1 { + reg = <0x1>; + phy-handle = <&phy_port0>; + phy-mode = "internal"; + + status = "disabled"; + }; + + switch_port2: port@2 { + reg = <0x2>; + phy-handle = <&phy_port1>; + phy-mode = "internal"; + + status = "disabled"; + }; + + switch_port3: port@3 { + reg = <0x3>; + phy-handle = <&phy_port2>; + phy-mode = "internal"; + + status = "disabled"; + }; + + switch_port4: port@4 { + reg = <0x4>; + phy-handle = <&phy_port3>; + phy-mode = "internal"; + + status = "disabled"; + }; + }; + + mdio { + #address-cells = <1>; + #size-cells = <0>; + + interrupt-parent = <&switch10>; + + phy_port0: phy@0 { + reg = <0x0>; + interrupts = <0>; + status = "disabled"; + }; + + phy_port1: phy@1 { + reg = <0x1>; + interrupts = <0>; + status = "disabled"; + }; + + phy_port2: phy@2 { + reg = <0x2>; + interrupts = <0>; + status = "disabled"; + }; + + phy_port3: phy@3 { + reg = <0x3>; + interrupts = <0>; + status = "disabled"; + }; + + phy_port4: phy@4 { + reg = <0x4>; + interrupts = <0>; + status = "disabled"; + }; + }; + }; + }; + }; -- cgit v1.2.3 From 65749009242bbf67d5be129b7b67c4f9bfd93360 Mon Sep 17 00:00:00 2001 From: Christian Hewitt Date: Thu, 23 Apr 2020 01:34:28 +0000 Subject: dt-bindings: net: bluetooth: Add device tree bindings for QCA9377 QCA9377 is a QCA ROME device frequently found in Android TV boxes. Signed-off-by: Christian Hewitt Signed-off-by: Marcel Holtmann --- Documentation/devicetree/bindings/net/qualcomm-bluetooth.txt | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/qualcomm-bluetooth.txt b/Documentation/devicetree/bindings/net/qualcomm-bluetooth.txt index badf597c0e58..f21ea24fb0ca 100644 --- a/Documentation/devicetree/bindings/net/qualcomm-bluetooth.txt +++ b/Documentation/devicetree/bindings/net/qualcomm-bluetooth.txt @@ -10,6 +10,7 @@ device the slave device is attached to. Required properties: - compatible: should contain one of the following: * "qcom,qca6174-bt" + * "qcom,qca9377-bt" * "qcom,wcn3990-bt" * "qcom,wcn3991-bt" * "qcom,wcn3998-bt" @@ -21,6 +22,10 @@ Optional properties for compatible string qcom,qca6174-bt: - clocks: clock provided to the controller (SUSCLK_32KHZ) - firmware-name: specify the name of nvm firmware to load +Optional properties for compatible string qcom,qca9377-bt: + + - max-speed: see Documentation/devicetree/bindings/serial/serial.yaml + Required properties for compatible string qcom,wcn399x-bt: - vddio-supply: VDD_IO supply regulator handle. -- cgit v1.2.3 From 4f80116d3df3b23ee4b83ea8557629e1799bc230 Mon Sep 17 00:00:00 2001 From: Roopa Prabhu Date: Mon, 27 Apr 2020 13:56:46 -0700 Subject: net: ipv4: add sysctl for nexthop api compatibility mode Current route nexthop API maintains user space compatibility with old route API by default. Dumps and netlink notifications support both new and old API format. In systems which have moved to the new API, this compatibility mode cancels some of the performance benefits provided by the new nexthop API. This patch adds new sysctl nexthop_compat_mode which is on by default but provides the ability to turn off compatibility mode allowing systems to run entirely with the new routing API. Old route API behaviour and support is not modified by this sysctl. Uses a single sysctl to cover both ipv4 and ipv6 following other sysctls. Covers dumps and delete notifications as suggested by David Ahern. Signed-off-by: Roopa Prabhu Reviewed-by: David Ahern Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 12 ++++++++++++ include/net/netns/ipv4.h | 2 ++ net/ipv4/af_inet.c | 1 + net/ipv4/fib_semantics.c | 3 +++ net/ipv4/nexthop.c | 5 +++-- net/ipv4/sysctl_net_ipv4.c | 9 +++++++++ net/ipv6/route.c | 3 ++- 7 files changed, 32 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 9375324aa8e1..5cdc37c34830 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -1560,6 +1560,18 @@ skip_notify_on_dev_down - BOOLEAN on userspace caches to track link events and evict routes. Default: false (generate message) +nexthop_compat_mode - BOOLEAN + New nexthop API provides a means for managing nexthops independent of + prefixes. Backwards compatibilty with old route format is enabled by + default which means route dumps and notifications contain the new + nexthop attribute but also the full, expanded nexthop definition. + Further, updates or deletes of a nexthop configuration generate route + notifications for each fib entry using the nexthop. Once a system + understands the new API, this sysctl can be disabled to achieve full + performance benefits of the new API by disabling the nexthop expansion + and extraneous notifications. + Default: true (backward compat mode) + IPv6 Fragmentation: ip6frag_high_thresh - INTEGER diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index 154b8f01499b..5acdb4d414c4 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -111,6 +111,8 @@ struct netns_ipv4 { int sysctl_tcp_early_demux; int sysctl_udp_early_demux; + int sysctl_nexthop_compat_mode; + int sysctl_fwmark_reflect; int sysctl_tcp_fwmark_accept; #ifdef CONFIG_NET_L3_MASTER_DEV diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index c618e242490f..6177c4ba0037 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1835,6 +1835,7 @@ static __net_init int inet_init_net(struct net *net) net->ipv4.sysctl_ip_early_demux = 1; net->ipv4.sysctl_udp_early_demux = 1; net->ipv4.sysctl_tcp_early_demux = 1; + net->ipv4.sysctl_nexthop_compat_mode = 1; #ifdef CONFIG_SYSCTL net->ipv4.sysctl_ip_prot_sock = PROT_SOCK; #endif diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 55ca2e521828..e53871e4a097 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -1780,6 +1780,8 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event, goto nla_put_failure; if (nexthop_is_blackhole(fi->nh)) rtm->rtm_type = RTN_BLACKHOLE; + if (!fi->fib_net->ipv4.sysctl_nexthop_compat_mode) + goto offload; } if (nhs == 1) { @@ -1805,6 +1807,7 @@ int fib_dump_info(struct sk_buff *skb, u32 portid, u32 seq, int event, goto nla_put_failure; } +offload: if (fri->offload) rtm->rtm_flags |= RTM_F_OFFLOAD; if (fri->trap) diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index 9999687ad6dc..3957364d556c 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -784,7 +784,8 @@ static void __remove_nexthop_fib(struct net *net, struct nexthop *nh) list_for_each_entry_safe(f6i, tmp, &nh->f6i_list, nh_list) { /* __ip6_del_rt does a release, so do a hold here */ fib6_info_hold(f6i); - ipv6_stub->ip6_del_rt(net, f6i, false); + ipv6_stub->ip6_del_rt(net, f6i, + !net->ipv4.sysctl_nexthop_compat_mode); } } @@ -1041,7 +1042,7 @@ out: if (!rc) { nh_base_seq_inc(net); nexthop_notify(RTM_NEWNEXTHOP, new_nh, &cfg->nlinfo); - if (replace_notify) + if (replace_notify && net->ipv4.sysctl_nexthop_compat_mode) nexthop_replace_notify(net, new_nh, &cfg->nlinfo); } diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 81b267e990a1..95ad71e76cc3 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -710,6 +710,15 @@ static struct ctl_table ipv4_net_table[] = { .mode = 0644, .proc_handler = proc_tcp_early_demux }, + { + .procname = "nexthop_compat_mode", + .data = &init_net.ipv4.sysctl_nexthop_compat_mode, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, + }, { .procname = "ip_default_ttl", .data = &init_net.ipv4.sysctl_ip_default_ttl, diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 486c36a14f24..803212aae4ca 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -5557,7 +5557,8 @@ static int rt6_fill_node(struct net *net, struct sk_buff *skb, if (nexthop_is_blackhole(rt->nh)) rtm->rtm_type = RTN_BLACKHOLE; - if (rt6_fill_node_nexthop(skb, rt->nh, &nh_flags) < 0) + if (net->ipv4.sysctl_nexthop_compat_mode && + rt6_fill_node_nexthop(skb, rt->nh, &nh_flags) < 0) goto nla_put_failure; rtm->rtm_flags |= nh_flags; -- cgit v1.2.3 From 74bc7c97fa88ae334752e7b45702d23813df8873 Mon Sep 17 00:00:00 2001 From: Jakub Kicinski Date: Mon, 27 Apr 2020 18:03:50 -0700 Subject: kselftest: add fixture variants Allow users to build parameterized variants of fixtures. If fixtures want variants, they call FIXTURE_VARIANT() to declare the structure to fill for each variant. Each fixture will be re-run for each of the variants defined by calling FIXTURE_VARIANT_ADD() with the differing parameters initializing the structure. Since tests are being re-run, additional initialization (steps, no_print) is also added. Signed-off-by: Jakub Kicinski Acked-by: Kees Cook Signed-off-by: David S. Miller --- Documentation/dev-tools/kselftest.rst | 3 +- tools/testing/selftests/kselftest_harness.h | 148 +++++++++++++++++++++++----- 2 files changed, 124 insertions(+), 27 deletions(-) (limited to 'Documentation') diff --git a/Documentation/dev-tools/kselftest.rst b/Documentation/dev-tools/kselftest.rst index 61ae13c44f91..5d1f56fcd2e7 100644 --- a/Documentation/dev-tools/kselftest.rst +++ b/Documentation/dev-tools/kselftest.rst @@ -301,7 +301,8 @@ Helpers .. kernel-doc:: tools/testing/selftests/kselftest_harness.h :functions: TH_LOG TEST TEST_SIGNAL FIXTURE FIXTURE_DATA FIXTURE_SETUP - FIXTURE_TEARDOWN TEST_F TEST_HARNESS_MAIN + FIXTURE_TEARDOWN TEST_F TEST_HARNESS_MAIN FIXTURE_VARIANT + FIXTURE_VARIANT_ADD Operators --------- diff --git a/tools/testing/selftests/kselftest_harness.h b/tools/testing/selftests/kselftest_harness.h index fa7185e45472..c9f03ef93338 100644 --- a/tools/testing/selftests/kselftest_harness.h +++ b/tools/testing/selftests/kselftest_harness.h @@ -168,9 +168,15 @@ #define __TEST_IMPL(test_name, _signal) \ static void test_name(struct __test_metadata *_metadata); \ + static inline void wrapper_##test_name( \ + struct __test_metadata *_metadata, \ + struct __fixture_variant_metadata *variant) \ + { \ + test_name(_metadata); \ + } \ static struct __test_metadata _##test_name##_object = \ { .name = #test_name, \ - .fn = &test_name, \ + .fn = &wrapper_##test_name, \ .fixture = &_fixture_global, \ .termsig = _signal, \ .timeout = TEST_TIMEOUT_DEFAULT, }; \ @@ -214,6 +220,7 @@ * populated and cleaned up using FIXTURE_SETUP() and FIXTURE_TEARDOWN(). */ #define FIXTURE(fixture_name) \ + FIXTURE_VARIANT(fixture_name); \ static struct __fixture_metadata _##fixture_name##_fixture_object = \ { .name = #fixture_name, }; \ static void __attribute__((constructor)) \ @@ -245,7 +252,10 @@ #define FIXTURE_SETUP(fixture_name) \ void fixture_name##_setup( \ struct __test_metadata __attribute__((unused)) *_metadata, \ - FIXTURE_DATA(fixture_name) __attribute__((unused)) *self) + FIXTURE_DATA(fixture_name) __attribute__((unused)) *self, \ + const FIXTURE_VARIANT(fixture_name) \ + __attribute__((unused)) *variant) + /** * FIXTURE_TEARDOWN(fixture_name) * *_metadata* is included so that EXPECT_* and ASSERT_* work correctly. @@ -267,6 +277,59 @@ struct __test_metadata __attribute__((unused)) *_metadata, \ FIXTURE_DATA(fixture_name) __attribute__((unused)) *self) +/** + * FIXTURE_VARIANT(fixture_name) - Optionally called once per fixture + * to declare fixture variant + * + * @fixture_name: fixture name + * + * .. code-block:: c + * + * FIXTURE_VARIANT(datatype name) { + * type property1; + * ... + * }; + * + * Defines type of constant parameters provided to FIXTURE_SETUP() and TEST_F() + * as *variant*. Variants allow the same tests to be run with different + * arguments. + */ +#define FIXTURE_VARIANT(fixture_name) struct _fixture_variant_##fixture_name + +/** + * FIXTURE_VARIANT_ADD(fixture_name, variant_name) - Called once per fixture + * variant to setup and register the data + * + * @fixture_name: fixture name + * @variant_name: name of the parameter set + * + * .. code-block:: c + * + * FIXTURE_ADD(datatype name) { + * .property1 = val1; + * ... + * }; + * + * Defines a variant of the test fixture, provided to FIXTURE_SETUP() and + * TEST_F() as *variant*. Tests of each fixture will be run once for each + * variant. + */ +#define FIXTURE_VARIANT_ADD(fixture_name, variant_name) \ + extern FIXTURE_VARIANT(fixture_name) \ + _##fixture_name##_##variant_name##_variant; \ + static struct __fixture_variant_metadata \ + _##fixture_name##_##variant_name##_object = \ + { .name = #variant_name, \ + .data = &_##fixture_name##_##variant_name##_variant}; \ + static void __attribute__((constructor)) \ + _register_##fixture_name##_##variant_name(void) \ + { \ + __register_fixture_variant(&_##fixture_name##_fixture_object, \ + &_##fixture_name##_##variant_name##_object); \ + } \ + FIXTURE_VARIANT(fixture_name) \ + _##fixture_name##_##variant_name##_variant = + /** * TEST_F(fixture_name, test_name) - Emits test registration and helpers for * fixture-based test cases @@ -297,18 +360,20 @@ #define __TEST_F_IMPL(fixture_name, test_name, signal, tmout) \ static void fixture_name##_##test_name( \ struct __test_metadata *_metadata, \ - FIXTURE_DATA(fixture_name) *self); \ + FIXTURE_DATA(fixture_name) *self, \ + const FIXTURE_VARIANT(fixture_name) *variant); \ static inline void wrapper_##fixture_name##_##test_name( \ - struct __test_metadata *_metadata) \ + struct __test_metadata *_metadata, \ + struct __fixture_variant_metadata *variant) \ { \ /* fixture data is alloced, setup, and torn down per call. */ \ FIXTURE_DATA(fixture_name) self; \ memset(&self, 0, sizeof(FIXTURE_DATA(fixture_name))); \ - fixture_name##_setup(_metadata, &self); \ + fixture_name##_setup(_metadata, &self, variant->data); \ /* Let setup failure terminate early. */ \ if (!_metadata->passed) \ return; \ - fixture_name##_##test_name(_metadata, &self); \ + fixture_name##_##test_name(_metadata, &self, variant->data); \ fixture_name##_teardown(_metadata, &self); \ } \ static struct __test_metadata \ @@ -326,7 +391,9 @@ } \ static void fixture_name##_##test_name( \ struct __test_metadata __attribute__((unused)) *_metadata, \ - FIXTURE_DATA(fixture_name) __attribute__((unused)) *self) + FIXTURE_DATA(fixture_name) __attribute__((unused)) *self, \ + const FIXTURE_VARIANT(fixture_name) \ + __attribute__((unused)) *variant) /** * TEST_HARNESS_MAIN - Simple wrapper to run the test harness @@ -660,11 +727,13 @@ } struct __test_metadata; +struct __fixture_variant_metadata; /* Contains all the information about a fixture. */ struct __fixture_metadata { const char *name; struct __test_metadata *tests; + struct __fixture_variant_metadata *variant; struct __fixture_metadata *prev, *next; } _fixture_global __attribute__((unused)) = { .name = "global", @@ -672,7 +741,6 @@ struct __fixture_metadata { }; static struct __fixture_metadata *__fixture_list = &_fixture_global; -static unsigned int __fixture_count; static int __constructor_order; #define _CONSTRUCTOR_ORDER_FORWARD 1 @@ -680,14 +748,27 @@ static int __constructor_order; static inline void __register_fixture(struct __fixture_metadata *f) { - __fixture_count++; __LIST_APPEND(__fixture_list, f); } +struct __fixture_variant_metadata { + const char *name; + const void *data; + struct __fixture_variant_metadata *prev, *next; +}; + +static inline void +__register_fixture_variant(struct __fixture_metadata *f, + struct __fixture_variant_metadata *variant) +{ + __LIST_APPEND(f->variant, variant); +} + /* Contains all the information for test execution and status checking. */ struct __test_metadata { const char *name; - void (*fn)(struct __test_metadata *); + void (*fn)(struct __test_metadata *, + struct __fixture_variant_metadata *); pid_t pid; /* pid of test when being run */ struct __fixture_metadata *fixture; int termsig; @@ -700,9 +781,6 @@ struct __test_metadata { struct __test_metadata *prev, *next; }; -/* Storage for the (global) tests to be run. */ -static unsigned int __test_count; - /* * Since constructors are called in reverse order, reverse the test * list so tests are run in source declaration order. @@ -714,7 +792,6 @@ static unsigned int __test_count; */ static inline void __register_test(struct __test_metadata *t) { - __test_count++; __LIST_APPEND(t->fixture->tests, t); } @@ -822,46 +899,65 @@ void __wait_for_test(struct __test_metadata *t) } void __run_test(struct __fixture_metadata *f, + struct __fixture_variant_metadata *variant, struct __test_metadata *t) { + /* reset test struct */ t->passed = 1; t->trigger = 0; - printf("[ RUN ] %s.%s\n", f->name, t->name); + t->step = 0; + t->no_print = 0; + + printf("[ RUN ] %s%s%s.%s\n", + f->name, variant->name[0] ? "." : "", variant->name, t->name); t->pid = fork(); if (t->pid < 0) { printf("ERROR SPAWNING TEST CHILD\n"); t->passed = 0; } else if (t->pid == 0) { - t->fn(t); + t->fn(t, variant); /* return the step that failed or 0 */ _exit(t->passed ? 0 : t->step); } else { __wait_for_test(t); } - printf("[ %4s ] %s.%s\n", (t->passed ? "OK" : "FAIL"), - f->name, t->name); + printf("[ %4s ] %s%s%s.%s\n", (t->passed ? "OK" : "FAIL"), + f->name, variant->name[0] ? "." : "", variant->name, t->name); } static int test_harness_run(int __attribute__((unused)) argc, char __attribute__((unused)) **argv) { + struct __fixture_variant_metadata no_variant = { .name = "", }; + struct __fixture_variant_metadata *v; struct __fixture_metadata *f; struct __test_metadata *t; int ret = 0; + unsigned int case_count = 0, test_count = 0; unsigned int count = 0; unsigned int pass_count = 0; + for (f = __fixture_list; f; f = f->next) { + for (v = f->variant ?: &no_variant; v; v = v->next) { + case_count++; + for (t = f->tests; t; t = t->next) + test_count++; + } + } + /* TODO(wad) add optional arguments similar to gtest. */ printf("[==========] Running %u tests from %u test cases.\n", - __test_count, __fixture_count + 1); + test_count, case_count); for (f = __fixture_list; f; f = f->next) { - for (t = f->tests; t; t = t->next) { - count++; - __run_test(f, t); - if (t->passed) - pass_count++; - else - ret = 1; + for (v = f->variant ?: &no_variant; v; v = v->next) { + for (t = f->tests; t; t = t->next) { + count++; + __run_test(f, v, t); + if (t->passed) + pass_count++; + else + ret = 1; + } } } printf("[==========] %u / %u tests passed.\n", pass_count, count); -- cgit v1.2.3 From da50d57abd7ecaba151600e726ccb944e7ddf81a Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:16 +0200 Subject: docs: networking: convert caif files to ReST There are two text files for caif, plus one already converted file. Convert the two remaining ones to ReST, create a new index.rst file for CAIF, adding it to the main networking documentation index. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/caif/Linux-CAIF.txt | 175 -------------------- Documentation/networking/caif/caif.rst | 2 - Documentation/networking/caif/index.rst | 13 ++ Documentation/networking/caif/linux_caif.rst | 195 ++++++++++++++++++++++ Documentation/networking/caif/spi_porting.rst | 229 ++++++++++++++++++++++++++ Documentation/networking/caif/spi_porting.txt | 208 ----------------------- Documentation/networking/index.rst | 1 + drivers/net/caif/Kconfig | 2 +- 8 files changed, 439 insertions(+), 386 deletions(-) delete mode 100644 Documentation/networking/caif/Linux-CAIF.txt create mode 100644 Documentation/networking/caif/index.rst create mode 100644 Documentation/networking/caif/linux_caif.rst create mode 100644 Documentation/networking/caif/spi_porting.rst delete mode 100644 Documentation/networking/caif/spi_porting.txt (limited to 'Documentation') diff --git a/Documentation/networking/caif/Linux-CAIF.txt b/Documentation/networking/caif/Linux-CAIF.txt deleted file mode 100644 index 0aa4bd381bec..000000000000 --- a/Documentation/networking/caif/Linux-CAIF.txt +++ /dev/null @@ -1,175 +0,0 @@ -Linux CAIF -=========== -copyright (C) ST-Ericsson AB 2010 -Author: Sjur Brendeland/ sjur.brandeland@stericsson.com -License terms: GNU General Public License (GPL) version 2 - - -Introduction ------------- -CAIF is a MUX protocol used by ST-Ericsson cellular modems for -communication between Modem and host. The host processes can open virtual AT -channels, initiate GPRS Data connections, Video channels and Utility Channels. -The Utility Channels are general purpose pipes between modem and host. - -ST-Ericsson modems support a number of transports between modem -and host. Currently, UART and Loopback are available for Linux. - - -Architecture: ------------- -The implementation of CAIF is divided into: -* CAIF Socket Layer and GPRS IP Interface. -* CAIF Core Protocol Implementation -* CAIF Link Layer, implemented as NET devices. - - - RTNL - ! - ! +------+ +------+ - ! +------+! +------+! - ! ! IP !! !Socket!! - +-------> !interf!+ ! API !+ <- CAIF Client APIs - ! +------+ +------! - ! ! ! - ! +-----------+ - ! ! - ! +------+ <- CAIF Core Protocol - ! ! CAIF ! - ! ! Core ! - ! +------+ - ! +----------!---------+ - ! ! ! ! - ! +------+ +-----+ +------+ - +--> ! HSI ! ! TTY ! ! USB ! <- Link Layer (Net Devices) - +------+ +-----+ +------+ - - - -I M P L E M E N T A T I O N -=========================== - - -CAIF Core Protocol Layer -========================================= - -CAIF Core layer implements the CAIF protocol as defined by ST-Ericsson. -It implements the CAIF protocol stack in a layered approach, where -each layer described in the specification is implemented as a separate layer. -The architecture is inspired by the design patterns "Protocol Layer" and -"Protocol Packet". - -== CAIF structure == -The Core CAIF implementation contains: - - Simple implementation of CAIF. - - Layered architecture (a la Streams), each layer in the CAIF - specification is implemented in a separate c-file. - - Clients must call configuration function to add PHY layer. - - Clients must implement CAIF layer to consume/produce - CAIF payload with receive and transmit functions. - - Clients must call configuration function to add and connect the - Client layer. - - When receiving / transmitting CAIF Packets (cfpkt), ownership is passed - to the called function (except for framing layers' receive function) - -Layered Architecture --------------------- -The CAIF protocol can be divided into two parts: Support functions and Protocol -Implementation. The support functions include: - - - CFPKT CAIF Packet. Implementation of CAIF Protocol Packet. The - CAIF Packet has functions for creating, destroying and adding content - and for adding/extracting header and trailers to protocol packets. - -The CAIF Protocol implementation contains: - - - CFCNFG CAIF Configuration layer. Configures the CAIF Protocol - Stack and provides a Client interface for adding Link-Layer and - Driver interfaces on top of the CAIF Stack. - - - CFCTRL CAIF Control layer. Encodes and Decodes control messages - such as enumeration and channel setup. Also matches request and - response messages. - - - CFSERVL General CAIF Service Layer functionality; handles flow - control and remote shutdown requests. - - - CFVEI CAIF VEI layer. Handles CAIF AT Channels on VEI (Virtual - External Interface). This layer encodes/decodes VEI frames. - - - CFDGML CAIF Datagram layer. Handles CAIF Datagram layer (IP - traffic), encodes/decodes Datagram frames. - - - CFMUX CAIF Mux layer. Handles multiplexing between multiple - physical bearers and multiple channels such as VEI, Datagram, etc. - The MUX keeps track of the existing CAIF Channels and - Physical Instances and selects the appropriate instance based - on Channel-Id and Physical-ID. - - - CFFRML CAIF Framing layer. Handles Framing i.e. Frame length - and frame checksum. - - - CFSERL CAIF Serial layer. Handles concatenation/split of frames - into CAIF Frames with correct length. - - - - +---------+ - | Config | - | CFCNFG | - +---------+ - ! - +---------+ +---------+ +---------+ - | AT | | Control | | Datagram| - | CFVEIL | | CFCTRL | | CFDGML | - +---------+ +---------+ +---------+ - \_____________!______________/ - ! - +---------+ - | MUX | - | | - +---------+ - _____!_____ - / \ - +---------+ +---------+ - | CFFRML | | CFFRML | - | Framing | | Framing | - +---------+ +---------+ - ! ! - +---------+ +---------+ - | | | Serial | - | | | CFSERL | - +---------+ +---------+ - - -In this layered approach the following "rules" apply. - - All layers embed the same structure "struct cflayer" - - A layer does not depend on any other layer's private data. - - Layers are stacked by setting the pointers - layer->up , layer->dn - - In order to send data upwards, each layer should do - layer->up->receive(layer->up, packet); - - In order to send data downwards, each layer should do - layer->dn->transmit(layer->dn, packet); - - -CAIF Socket and IP interface -=========================== - -The IP interface and CAIF socket API are implemented on top of the -CAIF Core protocol. The IP Interface and CAIF socket have an instance of -'struct cflayer', just like the CAIF Core protocol stack. -Net device and Socket implement the 'receive()' function defined by -'struct cflayer', just like the rest of the CAIF stack. In this way, transmit and -receive of packets is handled as by the rest of the layers: the 'dn->transmit()' -function is called in order to transmit data. - -Configuration of Link Layer ---------------------------- -The Link Layer is implemented as Linux network devices (struct net_device). -Payload handling and registration is done using standard Linux mechanisms. - -The CAIF Protocol relies on a loss-less link layer without implementing -retransmission. This implies that packet drops must not happen. -Therefore a flow-control mechanism is implemented where the physical -interface can initiate flow stop for all CAIF Channels. diff --git a/Documentation/networking/caif/caif.rst b/Documentation/networking/caif/caif.rst index 07afc8063d4d..a07213030ccf 100644 --- a/Documentation/networking/caif/caif.rst +++ b/Documentation/networking/caif/caif.rst @@ -1,5 +1,3 @@ -:orphan: - .. SPDX-License-Identifier: GPL-2.0 .. include:: diff --git a/Documentation/networking/caif/index.rst b/Documentation/networking/caif/index.rst new file mode 100644 index 000000000000..86e5b7832ec3 --- /dev/null +++ b/Documentation/networking/caif/index.rst @@ -0,0 +1,13 @@ +.. SPDX-License-Identifier: GPL-2.0 + +CAIF +==== + +Contents: + +.. toctree:: + :maxdepth: 2 + + linux_caif + caif + spi_porting diff --git a/Documentation/networking/caif/linux_caif.rst b/Documentation/networking/caif/linux_caif.rst new file mode 100644 index 000000000000..a0480862ab8c --- /dev/null +++ b/Documentation/networking/caif/linux_caif.rst @@ -0,0 +1,195 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +========== +Linux CAIF +========== + +Copyright |copy| ST-Ericsson AB 2010 + +:Author: Sjur Brendeland/ sjur.brandeland@stericsson.com +:License terms: GNU General Public License (GPL) version 2 + + +Introduction +============ + +CAIF is a MUX protocol used by ST-Ericsson cellular modems for +communication between Modem and host. The host processes can open virtual AT +channels, initiate GPRS Data connections, Video channels and Utility Channels. +The Utility Channels are general purpose pipes between modem and host. + +ST-Ericsson modems support a number of transports between modem +and host. Currently, UART and Loopback are available for Linux. + + +Architecture +============ + +The implementation of CAIF is divided into: + +* CAIF Socket Layer and GPRS IP Interface. +* CAIF Core Protocol Implementation +* CAIF Link Layer, implemented as NET devices. + +:: + + RTNL + ! + ! +------+ +------+ + ! +------+! +------+! + ! ! IP !! !Socket!! + +-------> !interf!+ ! API !+ <- CAIF Client APIs + ! +------+ +------! + ! ! ! + ! +-----------+ + ! ! + ! +------+ <- CAIF Core Protocol + ! ! CAIF ! + ! ! Core ! + ! +------+ + ! +----------!---------+ + ! ! ! ! + ! +------+ +-----+ +------+ + +--> ! HSI ! ! TTY ! ! USB ! <- Link Layer (Net Devices) + +------+ +-----+ +------+ + + + +Implementation +============== + + +CAIF Core Protocol Layer +------------------------ + +CAIF Core layer implements the CAIF protocol as defined by ST-Ericsson. +It implements the CAIF protocol stack in a layered approach, where +each layer described in the specification is implemented as a separate layer. +The architecture is inspired by the design patterns "Protocol Layer" and +"Protocol Packet". + +CAIF structure +^^^^^^^^^^^^^^ + +The Core CAIF implementation contains: + + - Simple implementation of CAIF. + - Layered architecture (a la Streams), each layer in the CAIF + specification is implemented in a separate c-file. + - Clients must call configuration function to add PHY layer. + - Clients must implement CAIF layer to consume/produce + CAIF payload with receive and transmit functions. + - Clients must call configuration function to add and connect the + Client layer. + - When receiving / transmitting CAIF Packets (cfpkt), ownership is passed + to the called function (except for framing layers' receive function) + +Layered Architecture +==================== + +The CAIF protocol can be divided into two parts: Support functions and Protocol +Implementation. The support functions include: + + - CFPKT CAIF Packet. Implementation of CAIF Protocol Packet. The + CAIF Packet has functions for creating, destroying and adding content + and for adding/extracting header and trailers to protocol packets. + +The CAIF Protocol implementation contains: + + - CFCNFG CAIF Configuration layer. Configures the CAIF Protocol + Stack and provides a Client interface for adding Link-Layer and + Driver interfaces on top of the CAIF Stack. + + - CFCTRL CAIF Control layer. Encodes and Decodes control messages + such as enumeration and channel setup. Also matches request and + response messages. + + - CFSERVL General CAIF Service Layer functionality; handles flow + control and remote shutdown requests. + + - CFVEI CAIF VEI layer. Handles CAIF AT Channels on VEI (Virtual + External Interface). This layer encodes/decodes VEI frames. + + - CFDGML CAIF Datagram layer. Handles CAIF Datagram layer (IP + traffic), encodes/decodes Datagram frames. + + - CFMUX CAIF Mux layer. Handles multiplexing between multiple + physical bearers and multiple channels such as VEI, Datagram, etc. + The MUX keeps track of the existing CAIF Channels and + Physical Instances and selects the appropriate instance based + on Channel-Id and Physical-ID. + + - CFFRML CAIF Framing layer. Handles Framing i.e. Frame length + and frame checksum. + + - CFSERL CAIF Serial layer. Handles concatenation/split of frames + into CAIF Frames with correct length. + +:: + + +---------+ + | Config | + | CFCNFG | + +---------+ + ! + +---------+ +---------+ +---------+ + | AT | | Control | | Datagram| + | CFVEIL | | CFCTRL | | CFDGML | + +---------+ +---------+ +---------+ + \_____________!______________/ + ! + +---------+ + | MUX | + | | + +---------+ + _____!_____ + / \ + +---------+ +---------+ + | CFFRML | | CFFRML | + | Framing | | Framing | + +---------+ +---------+ + ! ! + +---------+ +---------+ + | | | Serial | + | | | CFSERL | + +---------+ +---------+ + + +In this layered approach the following "rules" apply. + + - All layers embed the same structure "struct cflayer" + - A layer does not depend on any other layer's private data. + - Layers are stacked by setting the pointers:: + + layer->up , layer->dn + + - In order to send data upwards, each layer should do:: + + layer->up->receive(layer->up, packet); + + - In order to send data downwards, each layer should do:: + + layer->dn->transmit(layer->dn, packet); + + +CAIF Socket and IP interface +============================ + +The IP interface and CAIF socket API are implemented on top of the +CAIF Core protocol. The IP Interface and CAIF socket have an instance of +'struct cflayer', just like the CAIF Core protocol stack. +Net device and Socket implement the 'receive()' function defined by +'struct cflayer', just like the rest of the CAIF stack. In this way, transmit and +receive of packets is handled as by the rest of the layers: the 'dn->transmit()' +function is called in order to transmit data. + +Configuration of Link Layer +--------------------------- +The Link Layer is implemented as Linux network devices (struct net_device). +Payload handling and registration is done using standard Linux mechanisms. + +The CAIF Protocol relies on a loss-less link layer without implementing +retransmission. This implies that packet drops must not happen. +Therefore a flow-control mechanism is implemented where the physical +interface can initiate flow stop for all CAIF Channels. diff --git a/Documentation/networking/caif/spi_porting.rst b/Documentation/networking/caif/spi_porting.rst new file mode 100644 index 000000000000..d49f874b20ac --- /dev/null +++ b/Documentation/networking/caif/spi_porting.rst @@ -0,0 +1,229 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================ +CAIF SPI porting +================ + +CAIF SPI basics +=============== + +Running CAIF over SPI needs some extra setup, owing to the nature of SPI. +Two extra GPIOs have been added in order to negotiate the transfers +between the master and the slave. The minimum requirement for running +CAIF over SPI is a SPI slave chip and two GPIOs (more details below). +Please note that running as a slave implies that you need to keep up +with the master clock. An overrun or underrun event is fatal. + +CAIF SPI framework +================== + +To make porting as easy as possible, the CAIF SPI has been divided in +two parts. The first part (called the interface part) deals with all +generic functionality such as length framing, SPI frame negotiation +and SPI frame delivery and transmission. The other part is the CAIF +SPI slave device part, which is the module that you have to write if +you want to run SPI CAIF on a new hardware. This part takes care of +the physical hardware, both with regard to SPI and to GPIOs. + +- Implementing a CAIF SPI device: + + - Functionality provided by the CAIF SPI slave device: + + In order to implement a SPI device you will, as a minimum, + need to implement the following + functions: + + :: + + int (*init_xfer) (struct cfspi_xfer * xfer, struct cfspi_dev *dev): + + This function is called by the CAIF SPI interface to give + you a chance to set up your hardware to be ready to receive + a stream of data from the master. The xfer structure contains + both physical and logical addresses, as well as the total length + of the transfer in both directions.The dev parameter can be used + to map to different CAIF SPI slave devices. + + :: + + void (*sig_xfer) (bool xfer, struct cfspi_dev *dev): + + This function is called by the CAIF SPI interface when the output + (SPI_INT) GPIO needs to change state. The boolean value of the xfer + variable indicates whether the GPIO should be asserted (HIGH) or + deasserted (LOW). The dev parameter can be used to map to different CAIF + SPI slave devices. + + - Functionality provided by the CAIF SPI interface: + + :: + + void (*ss_cb) (bool assert, struct cfspi_ifc *ifc); + + This function is called by the CAIF SPI slave device in order to + signal a change of state of the input GPIO (SS) to the interface. + Only active edges are mandatory to be reported. + This function can be called from IRQ context (recommended in order + not to introduce latency). The ifc parameter should be the pointer + returned from the platform probe function in the SPI device structure. + + :: + + void (*xfer_done_cb) (struct cfspi_ifc *ifc); + + This function is called by the CAIF SPI slave device in order to + report that a transfer is completed. This function should only be + called once both the transmission and the reception are completed. + This function can be called from IRQ context (recommended in order + not to introduce latency). The ifc parameter should be the pointer + returned from the platform probe function in the SPI device structure. + + - Connecting the bits and pieces: + + - Filling in the SPI slave device structure: + + Connect the necessary callback functions. + + Indicate clock speed (used to calculate toggle delays). + + Chose a suitable name (helps debugging if you use several CAIF + SPI slave devices). + + Assign your private data (can be used to map to your + structure). + + - Filling in the SPI slave platform device structure: + + Add name of driver to connect to ("cfspi_sspi"). + + Assign the SPI slave device structure as platform data. + +Padding +======= + +In order to optimize throughput, a number of SPI padding options are provided. +Padding can be enabled independently for uplink and downlink transfers. +Padding can be enabled for the head, the tail and for the total frame size. +The padding needs to be correctly configured on both sides of the link. +The padding can be changed via module parameters in cfspi_sspi.c or via +the sysfs directory of the cfspi_sspi driver (before device registration). + +- CAIF SPI device template:: + + /* + * Copyright (C) ST-Ericsson AB 2010 + * Author: Daniel Martensson / Daniel.Martensson@stericsson.com + * License terms: GNU General Public License (GPL), version 2. + * + */ + + #include + #include + #include + #include + #include + #include + #include + + MODULE_LICENSE("GPL"); + + struct sspi_struct { + struct cfspi_dev sdev; + struct cfspi_xfer *xfer; + }; + + static struct sspi_struct slave; + static struct platform_device slave_device; + + static irqreturn_t sspi_irq(int irq, void *arg) + { + /* You only need to trigger on an edge to the active state of the + * SS signal. Once a edge is detected, the ss_cb() function should be + * called with the parameter assert set to true. It is OK + * (and even advised) to call the ss_cb() function in IRQ context in + * order not to add any delay. */ + + return IRQ_HANDLED; + } + + static void sspi_complete(void *context) + { + /* Normally the DMA or the SPI framework will call you back + * in something similar to this. The only thing you need to + * do is to call the xfer_done_cb() function, providing the pointer + * to the CAIF SPI interface. It is OK to call this function + * from IRQ context. */ + } + + static int sspi_init_xfer(struct cfspi_xfer *xfer, struct cfspi_dev *dev) + { + /* Store transfer info. For a normal implementation you should + * set up your DMA here and make sure that you are ready to + * receive the data from the master SPI. */ + + struct sspi_struct *sspi = (struct sspi_struct *)dev->priv; + + sspi->xfer = xfer; + + return 0; + } + + void sspi_sig_xfer(bool xfer, struct cfspi_dev *dev) + { + /* If xfer is true then you should assert the SPI_INT to indicate to + * the master that you are ready to receive the data from the master + * SPI. If xfer is false then you should de-assert SPI_INT to indicate + * that the transfer is done. + */ + + struct sspi_struct *sspi = (struct sspi_struct *)dev->priv; + } + + static void sspi_release(struct device *dev) + { + /* + * Here you should release your SPI device resources. + */ + } + + static int __init sspi_init(void) + { + /* Here you should initialize your SPI device by providing the + * necessary functions, clock speed, name and private data. Once + * done, you can register your device with the + * platform_device_register() function. This function will return + * with the CAIF SPI interface initialized. This is probably also + * the place where you should set up your GPIOs, interrupts and SPI + * resources. */ + + int res = 0; + + /* Initialize slave device. */ + slave.sdev.init_xfer = sspi_init_xfer; + slave.sdev.sig_xfer = sspi_sig_xfer; + slave.sdev.clk_mhz = 13; + slave.sdev.priv = &slave; + slave.sdev.name = "spi_sspi"; + slave_device.dev.release = sspi_release; + + /* Initialize platform device. */ + slave_device.name = "cfspi_sspi"; + slave_device.dev.platform_data = &slave.sdev; + + /* Register platform device. */ + res = platform_device_register(&slave_device); + if (res) { + printk(KERN_WARNING "sspi_init: failed to register dev.\n"); + return -ENODEV; + } + + return res; + } + + static void __exit sspi_exit(void) + { + platform_device_del(&slave_device); + } + + module_init(sspi_init); + module_exit(sspi_exit); diff --git a/Documentation/networking/caif/spi_porting.txt b/Documentation/networking/caif/spi_porting.txt deleted file mode 100644 index 9efd0687dc4c..000000000000 --- a/Documentation/networking/caif/spi_porting.txt +++ /dev/null @@ -1,208 +0,0 @@ -- CAIF SPI porting - - -- CAIF SPI basics: - -Running CAIF over SPI needs some extra setup, owing to the nature of SPI. -Two extra GPIOs have been added in order to negotiate the transfers - between the master and the slave. The minimum requirement for running -CAIF over SPI is a SPI slave chip and two GPIOs (more details below). -Please note that running as a slave implies that you need to keep up -with the master clock. An overrun or underrun event is fatal. - -- CAIF SPI framework: - -To make porting as easy as possible, the CAIF SPI has been divided in -two parts. The first part (called the interface part) deals with all -generic functionality such as length framing, SPI frame negotiation -and SPI frame delivery and transmission. The other part is the CAIF -SPI slave device part, which is the module that you have to write if -you want to run SPI CAIF on a new hardware. This part takes care of -the physical hardware, both with regard to SPI and to GPIOs. - -- Implementing a CAIF SPI device: - - - Functionality provided by the CAIF SPI slave device: - - In order to implement a SPI device you will, as a minimum, - need to implement the following - functions: - - int (*init_xfer) (struct cfspi_xfer * xfer, struct cfspi_dev *dev): - - This function is called by the CAIF SPI interface to give - you a chance to set up your hardware to be ready to receive - a stream of data from the master. The xfer structure contains - both physical and logical addresses, as well as the total length - of the transfer in both directions.The dev parameter can be used - to map to different CAIF SPI slave devices. - - void (*sig_xfer) (bool xfer, struct cfspi_dev *dev): - - This function is called by the CAIF SPI interface when the output - (SPI_INT) GPIO needs to change state. The boolean value of the xfer - variable indicates whether the GPIO should be asserted (HIGH) or - deasserted (LOW). The dev parameter can be used to map to different CAIF - SPI slave devices. - - - Functionality provided by the CAIF SPI interface: - - void (*ss_cb) (bool assert, struct cfspi_ifc *ifc); - - This function is called by the CAIF SPI slave device in order to - signal a change of state of the input GPIO (SS) to the interface. - Only active edges are mandatory to be reported. - This function can be called from IRQ context (recommended in order - not to introduce latency). The ifc parameter should be the pointer - returned from the platform probe function in the SPI device structure. - - void (*xfer_done_cb) (struct cfspi_ifc *ifc); - - This function is called by the CAIF SPI slave device in order to - report that a transfer is completed. This function should only be - called once both the transmission and the reception are completed. - This function can be called from IRQ context (recommended in order - not to introduce latency). The ifc parameter should be the pointer - returned from the platform probe function in the SPI device structure. - - - Connecting the bits and pieces: - - - Filling in the SPI slave device structure: - - Connect the necessary callback functions. - Indicate clock speed (used to calculate toggle delays). - Chose a suitable name (helps debugging if you use several CAIF - SPI slave devices). - Assign your private data (can be used to map to your structure). - - - Filling in the SPI slave platform device structure: - Add name of driver to connect to ("cfspi_sspi"). - Assign the SPI slave device structure as platform data. - -- Padding: - -In order to optimize throughput, a number of SPI padding options are provided. -Padding can be enabled independently for uplink and downlink transfers. -Padding can be enabled for the head, the tail and for the total frame size. -The padding needs to be correctly configured on both sides of the link. -The padding can be changed via module parameters in cfspi_sspi.c or via -the sysfs directory of the cfspi_sspi driver (before device registration). - -- CAIF SPI device template: - -/* - * Copyright (C) ST-Ericsson AB 2010 - * Author: Daniel Martensson / Daniel.Martensson@stericsson.com - * License terms: GNU General Public License (GPL), version 2. - * - */ - -#include -#include -#include -#include -#include -#include -#include - -MODULE_LICENSE("GPL"); - -struct sspi_struct { - struct cfspi_dev sdev; - struct cfspi_xfer *xfer; -}; - -static struct sspi_struct slave; -static struct platform_device slave_device; - -static irqreturn_t sspi_irq(int irq, void *arg) -{ - /* You only need to trigger on an edge to the active state of the - * SS signal. Once a edge is detected, the ss_cb() function should be - * called with the parameter assert set to true. It is OK - * (and even advised) to call the ss_cb() function in IRQ context in - * order not to add any delay. */ - - return IRQ_HANDLED; -} - -static void sspi_complete(void *context) -{ - /* Normally the DMA or the SPI framework will call you back - * in something similar to this. The only thing you need to - * do is to call the xfer_done_cb() function, providing the pointer - * to the CAIF SPI interface. It is OK to call this function - * from IRQ context. */ -} - -static int sspi_init_xfer(struct cfspi_xfer *xfer, struct cfspi_dev *dev) -{ - /* Store transfer info. For a normal implementation you should - * set up your DMA here and make sure that you are ready to - * receive the data from the master SPI. */ - - struct sspi_struct *sspi = (struct sspi_struct *)dev->priv; - - sspi->xfer = xfer; - - return 0; -} - -void sspi_sig_xfer(bool xfer, struct cfspi_dev *dev) -{ - /* If xfer is true then you should assert the SPI_INT to indicate to - * the master that you are ready to receive the data from the master - * SPI. If xfer is false then you should de-assert SPI_INT to indicate - * that the transfer is done. - */ - - struct sspi_struct *sspi = (struct sspi_struct *)dev->priv; -} - -static void sspi_release(struct device *dev) -{ - /* - * Here you should release your SPI device resources. - */ -} - -static int __init sspi_init(void) -{ - /* Here you should initialize your SPI device by providing the - * necessary functions, clock speed, name and private data. Once - * done, you can register your device with the - * platform_device_register() function. This function will return - * with the CAIF SPI interface initialized. This is probably also - * the place where you should set up your GPIOs, interrupts and SPI - * resources. */ - - int res = 0; - - /* Initialize slave device. */ - slave.sdev.init_xfer = sspi_init_xfer; - slave.sdev.sig_xfer = sspi_sig_xfer; - slave.sdev.clk_mhz = 13; - slave.sdev.priv = &slave; - slave.sdev.name = "spi_sspi"; - slave_device.dev.release = sspi_release; - - /* Initialize platform device. */ - slave_device.name = "cfspi_sspi"; - slave_device.dev.platform_data = &slave.sdev; - - /* Register platform device. */ - res = platform_device_register(&slave_device); - if (res) { - printk(KERN_WARNING "sspi_init: failed to register dev.\n"); - return -ENODEV; - } - - return res; -} - -static void __exit sspi_exit(void) -{ - platform_device_del(&slave_device); -} - -module_init(sspi_init); -module_exit(sspi_exit); diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 6538ede29661..5b3421ec25ec 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -15,6 +15,7 @@ Contents: device_drivers/index dsa/index devlink/index + caif/index ethtool-netlink ieee802154 j1939 diff --git a/drivers/net/caif/Kconfig b/drivers/net/caif/Kconfig index 661c25eb1c46..1538ad194cf4 100644 --- a/drivers/net/caif/Kconfig +++ b/drivers/net/caif/Kconfig @@ -28,7 +28,7 @@ config CAIF_SPI_SLAVE The CAIF Link layer SPI Protocol driver for Slave SPI interface. This driver implements a platform driver to accommodate for a platform specific SPI device. A sample CAIF SPI Platform device is - provided in . + provided in . config CAIF_SPI_SYNC bool "Next command and length in start of frame" -- cgit v1.2.3 From a434aaba17f56c0a25edff4104dd5f9d5b3ceba2 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:17 +0200 Subject: docs: networking: convert 6pack.txt to ReST - add SPDX header; - use title markups; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/6pack.rst | 191 +++++++++++++++++++++++++++++++++++++ Documentation/networking/6pack.txt | 175 --------------------------------- Documentation/networking/index.rst | 1 + drivers/net/hamradio/Kconfig | 2 +- 4 files changed, 193 insertions(+), 176 deletions(-) create mode 100644 Documentation/networking/6pack.rst delete mode 100644 Documentation/networking/6pack.txt (limited to 'Documentation') diff --git a/Documentation/networking/6pack.rst b/Documentation/networking/6pack.rst new file mode 100644 index 000000000000..bc5bf1f1a98f --- /dev/null +++ b/Documentation/networking/6pack.rst @@ -0,0 +1,191 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============== +6pack Protocol +============== + +This is the 6pack-mini-HOWTO, written by + +Andreas Könsgen DG3KQ + +:Internet: ajk@comnets.uni-bremen.de +:AMPR-net: dg3kq@db0pra.ampr.org +:AX.25: dg3kq@db0ach.#nrw.deu.eu + +Last update: April 7, 1998 + +1. What is 6pack, and what are the advantages to KISS? +====================================================== + +6pack is a transmission protocol for data exchange between the PC and +the TNC over a serial line. It can be used as an alternative to KISS. + +6pack has two major advantages: + +- The PC is given full control over the radio + channel. Special control data is exchanged between the PC and the TNC so + that the PC knows at any time if the TNC is receiving data, if a TNC + buffer underrun or overrun has occurred, if the PTT is + set and so on. This control data is processed at a higher priority than + normal data, so a data stream can be interrupted at any time to issue an + important event. This helps to improve the channel access and timing + algorithms as everything is computed in the PC. It would even be possible + to experiment with something completely different from the known CSMA and + DAMA channel access methods. + This kind of real-time control is especially important to supply several + TNCs that are connected between each other and the PC by a daisy chain + (however, this feature is not supported yet by the Linux 6pack driver). + +- Each packet transferred over the serial line is supplied with a checksum, + so it is easy to detect errors due to problems on the serial line. + Received packets that are corrupt are not passed on to the AX.25 layer. + Damaged packets that the TNC has received from the PC are not transmitted. + +More details about 6pack are described in the file 6pack.ps that is located +in the doc directory of the AX.25 utilities package. + +2. Who has developed the 6pack protocol? +======================================== + +The 6pack protocol has been developed by Ekki Plicht DF4OR, Henning Rech +DF9IC and Gunter Jost DK7WJ. A driver for 6pack, written by Gunter Jost and +Matthias Welwarsky DG2FEF, comes along with the PC version of FlexNet. +They have also written a firmware for TNCs to perform the 6pack +protocol (see section 4 below). + +3. Where can I get the latest version of 6pack for LinuX? +========================================================= + +At the moment, the 6pack stuff can obtained via anonymous ftp from +db0bm.automation.fh-aachen.de. In the directory /incoming/dg3kq, +there is a file named 6pack.tgz. + +4. Preparing the TNC for 6pack operation +======================================== + +To be able to use 6pack, a special firmware for the TNC is needed. The EPROM +of a newly bought TNC does not contain 6pack, so you will have to +program an EPROM yourself. The image file for 6pack EPROMs should be +available on any packet radio box where PC/FlexNet can be found. The name of +the file is 6pack.bin. This file is copyrighted and maintained by the FlexNet +team. It can be used under the terms of the license that comes along +with PC/FlexNet. Please do not ask me about the internals of this file as I +don't know anything about it. I used a textual description of the 6pack +protocol to program the Linux driver. + +TNCs contain a 64kByte EPROM, the lower half of which is used for +the firmware/KISS. The upper half is either empty or is sometimes +programmed with software called TAPR. In the latter case, the TNC +is supplied with a DIP switch so you can easily change between the +two systems. When programming a new EPROM, one of the systems is replaced +by 6pack. It is useful to replace TAPR, as this software is rarely used +nowadays. If your TNC is not equipped with the switch mentioned above, you +can build in one yourself that switches over the highest address pin +of the EPROM between HIGH and LOW level. After having inserted the new EPROM +and switched to 6pack, apply power to the TNC for a first test. The connect +and the status LED are lit for about a second if the firmware initialises +the TNC correctly. + +5. Building and installing the 6pack driver +=========================================== + +The driver has been tested with kernel version 2.1.90. Use with older +kernels may lead to a compilation error because the interface to a kernel +function has been changed in the 2.1.8x kernels. + +How to turn on 6pack support: +============================= + +- In the linux kernel configuration program, select the code maturity level + options menu and turn on the prompting for development drivers. + +- Select the amateur radio support menu and turn on the serial port 6pack + driver. + +- Compile and install the kernel and the modules. + +To use the driver, the kissattach program delivered with the AX.25 utilities +has to be modified. + +- Do a cd to the directory that holds the kissattach sources. Edit the + kissattach.c file. At the top, insert the following lines:: + + #ifndef N_6PACK + #define N_6PACK (N_AX25+1) + #endif + + Then find the line: + + int disc = N_AX25; + + and replace N_AX25 by N_6PACK. + +- Recompile kissattach. Rename it to spattach to avoid confusions. + +Installing the driver: +---------------------- + +- Do an insmod 6pack. Look at your /var/log/messages file to check if the + module has printed its initialization message. + +- Do a spattach as you would launch kissattach when starting a KISS port. + Check if the kernel prints the message '6pack: TNC found'. + +- From here, everything should work as if you were setting up a KISS port. + The only difference is that the network device that represents + the 6pack port is called sp instead of sl or ax. So, sp0 would be the + first 6pack port. + +Although the driver has been tested on various platforms, I still declare it +ALPHA. BE CAREFUL! Sync your disks before insmoding the 6pack module +and spattaching. Watch out if your computer behaves strangely. Read section +6 of this file about known problems. + +Note that the connect and status LEDs of the TNC are controlled in a +different way than they are when the TNC is used with PC/FlexNet. When using +FlexNet, the connect LED is on if there is a connection; the status LED is +on if there is data in the buffer of the PC's AX.25 engine that has to be +transmitted. Under Linux, the 6pack layer is beyond the AX.25 layer, +so the 6pack driver doesn't know anything about connects or data that +has not yet been transmitted. Therefore the LEDs are controlled +as they are in KISS mode: The connect LED is turned on if data is transferred +from the PC to the TNC over the serial line, the status LED if data is +sent to the PC. + +6. Known problems +================= + +When testing the driver with 2.0.3x kernels and +operating with data rates on the radio channel of 9600 Baud or higher, +the driver may, on certain systems, sometimes print the message '6pack: +bad checksum', which is due to data loss if the other station sends two +or more subsequent packets. I have been told that this is due to a problem +with the serial driver of 2.0.3x kernels. I don't know yet if the problem +still exists with 2.1.x kernels, as I have heard that the serial driver +code has been changed with 2.1.x. + +When shutting down the sp interface with ifconfig, the kernel crashes if +there is still an AX.25 connection left over which an IP connection was +running, even if that IP connection is already closed. The problem does not +occur when there is a bare AX.25 connection still running. I don't know if +this is a problem of the 6pack driver or something else in the kernel. + +The driver has been tested as a module, not yet as a kernel-builtin driver. + +The 6pack protocol supports daisy-chaining of TNCs in a token ring, which is +connected to one serial port of the PC. This feature is not implemented +and at least at the moment I won't be able to do it because I do not have +the opportunity to build a TNC daisy-chain and test it. + +Some of the comments in the source code are inaccurate. They are left from +the SLIP/KISS driver, from which the 6pack driver has been derived. +I haven't modified or removed them yet -- sorry! The code itself needs +some cleaning and optimizing. This will be done in a later release. + +If you encounter a bug or if you have a question or suggestion concerning the +driver, feel free to mail me, using the addresses given at the beginning of +this file. + +Have fun! + +Andreas diff --git a/Documentation/networking/6pack.txt b/Documentation/networking/6pack.txt deleted file mode 100644 index 8f339428fdf4..000000000000 --- a/Documentation/networking/6pack.txt +++ /dev/null @@ -1,175 +0,0 @@ -This is the 6pack-mini-HOWTO, written by - -Andreas Könsgen DG3KQ -Internet: ajk@comnets.uni-bremen.de -AMPR-net: dg3kq@db0pra.ampr.org -AX.25: dg3kq@db0ach.#nrw.deu.eu - -Last update: April 7, 1998 - -1. What is 6pack, and what are the advantages to KISS? - -6pack is a transmission protocol for data exchange between the PC and -the TNC over a serial line. It can be used as an alternative to KISS. - -6pack has two major advantages: -- The PC is given full control over the radio - channel. Special control data is exchanged between the PC and the TNC so - that the PC knows at any time if the TNC is receiving data, if a TNC - buffer underrun or overrun has occurred, if the PTT is - set and so on. This control data is processed at a higher priority than - normal data, so a data stream can be interrupted at any time to issue an - important event. This helps to improve the channel access and timing - algorithms as everything is computed in the PC. It would even be possible - to experiment with something completely different from the known CSMA and - DAMA channel access methods. - This kind of real-time control is especially important to supply several - TNCs that are connected between each other and the PC by a daisy chain - (however, this feature is not supported yet by the Linux 6pack driver). - -- Each packet transferred over the serial line is supplied with a checksum, - so it is easy to detect errors due to problems on the serial line. - Received packets that are corrupt are not passed on to the AX.25 layer. - Damaged packets that the TNC has received from the PC are not transmitted. - -More details about 6pack are described in the file 6pack.ps that is located -in the doc directory of the AX.25 utilities package. - -2. Who has developed the 6pack protocol? - -The 6pack protocol has been developed by Ekki Plicht DF4OR, Henning Rech -DF9IC and Gunter Jost DK7WJ. A driver for 6pack, written by Gunter Jost and -Matthias Welwarsky DG2FEF, comes along with the PC version of FlexNet. -They have also written a firmware for TNCs to perform the 6pack -protocol (see section 4 below). - -3. Where can I get the latest version of 6pack for LinuX? - -At the moment, the 6pack stuff can obtained via anonymous ftp from -db0bm.automation.fh-aachen.de. In the directory /incoming/dg3kq, -there is a file named 6pack.tgz. - -4. Preparing the TNC for 6pack operation - -To be able to use 6pack, a special firmware for the TNC is needed. The EPROM -of a newly bought TNC does not contain 6pack, so you will have to -program an EPROM yourself. The image file for 6pack EPROMs should be -available on any packet radio box where PC/FlexNet can be found. The name of -the file is 6pack.bin. This file is copyrighted and maintained by the FlexNet -team. It can be used under the terms of the license that comes along -with PC/FlexNet. Please do not ask me about the internals of this file as I -don't know anything about it. I used a textual description of the 6pack -protocol to program the Linux driver. - -TNCs contain a 64kByte EPROM, the lower half of which is used for -the firmware/KISS. The upper half is either empty or is sometimes -programmed with software called TAPR. In the latter case, the TNC -is supplied with a DIP switch so you can easily change between the -two systems. When programming a new EPROM, one of the systems is replaced -by 6pack. It is useful to replace TAPR, as this software is rarely used -nowadays. If your TNC is not equipped with the switch mentioned above, you -can build in one yourself that switches over the highest address pin -of the EPROM between HIGH and LOW level. After having inserted the new EPROM -and switched to 6pack, apply power to the TNC for a first test. The connect -and the status LED are lit for about a second if the firmware initialises -the TNC correctly. - -5. Building and installing the 6pack driver - -The driver has been tested with kernel version 2.1.90. Use with older -kernels may lead to a compilation error because the interface to a kernel -function has been changed in the 2.1.8x kernels. - -How to turn on 6pack support: - -- In the linux kernel configuration program, select the code maturity level - options menu and turn on the prompting for development drivers. - -- Select the amateur radio support menu and turn on the serial port 6pack - driver. - -- Compile and install the kernel and the modules. - -To use the driver, the kissattach program delivered with the AX.25 utilities -has to be modified. - -- Do a cd to the directory that holds the kissattach sources. Edit the - kissattach.c file. At the top, insert the following lines: - - #ifndef N_6PACK - #define N_6PACK (N_AX25+1) - #endif - - Then find the line - - int disc = N_AX25; - - and replace N_AX25 by N_6PACK. - -- Recompile kissattach. Rename it to spattach to avoid confusions. - -Installing the driver: - -- Do an insmod 6pack. Look at your /var/log/messages file to check if the - module has printed its initialization message. - -- Do a spattach as you would launch kissattach when starting a KISS port. - Check if the kernel prints the message '6pack: TNC found'. - -- From here, everything should work as if you were setting up a KISS port. - The only difference is that the network device that represents - the 6pack port is called sp instead of sl or ax. So, sp0 would be the - first 6pack port. - -Although the driver has been tested on various platforms, I still declare it -ALPHA. BE CAREFUL! Sync your disks before insmoding the 6pack module -and spattaching. Watch out if your computer behaves strangely. Read section -6 of this file about known problems. - -Note that the connect and status LEDs of the TNC are controlled in a -different way than they are when the TNC is used with PC/FlexNet. When using -FlexNet, the connect LED is on if there is a connection; the status LED is -on if there is data in the buffer of the PC's AX.25 engine that has to be -transmitted. Under Linux, the 6pack layer is beyond the AX.25 layer, -so the 6pack driver doesn't know anything about connects or data that -has not yet been transmitted. Therefore the LEDs are controlled -as they are in KISS mode: The connect LED is turned on if data is transferred -from the PC to the TNC over the serial line, the status LED if data is -sent to the PC. - -6. Known problems - -When testing the driver with 2.0.3x kernels and -operating with data rates on the radio channel of 9600 Baud or higher, -the driver may, on certain systems, sometimes print the message '6pack: -bad checksum', which is due to data loss if the other station sends two -or more subsequent packets. I have been told that this is due to a problem -with the serial driver of 2.0.3x kernels. I don't know yet if the problem -still exists with 2.1.x kernels, as I have heard that the serial driver -code has been changed with 2.1.x. - -When shutting down the sp interface with ifconfig, the kernel crashes if -there is still an AX.25 connection left over which an IP connection was -running, even if that IP connection is already closed. The problem does not -occur when there is a bare AX.25 connection still running. I don't know if -this is a problem of the 6pack driver or something else in the kernel. - -The driver has been tested as a module, not yet as a kernel-builtin driver. - -The 6pack protocol supports daisy-chaining of TNCs in a token ring, which is -connected to one serial port of the PC. This feature is not implemented -and at least at the moment I won't be able to do it because I do not have -the opportunity to build a TNC daisy-chain and test it. - -Some of the comments in the source code are inaccurate. They are left from -the SLIP/KISS driver, from which the 6pack driver has been derived. -I haven't modified or removed them yet -- sorry! The code itself needs -some cleaning and optimizing. This will be done in a later release. - -If you encounter a bug or if you have a question or suggestion concerning the -driver, feel free to mail me, using the addresses given at the beginning of -this file. - -Have fun! - -Andreas diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 5b3421ec25ec..dc37fc8d5bee 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -37,6 +37,7 @@ Contents: tls-offload nfc 6lowpan + 6pack .. only:: subproject and html diff --git a/drivers/net/hamradio/Kconfig b/drivers/net/hamradio/Kconfig index 8e05b5c31a77..bf306fed04cc 100644 --- a/drivers/net/hamradio/Kconfig +++ b/drivers/net/hamradio/Kconfig @@ -30,7 +30,7 @@ config 6PACK Note that this driver is still experimental and might cause problems. For details about the features and the usage of the - driver, read . + driver, read . To compile this driver as a module, choose M here: the module will be called 6pack. -- cgit v1.2.3 From 5a7f3132121bbcafd61f616170a08e511d675347 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:18 +0200 Subject: docs: networking: convert altera_tse.txt to ReST - add SPDX header; - use copyright symbol; - adjust titles and chapters, adding proper markups; - mark lists as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/altera_tse.rst | 286 ++++++++++++++++++++++++++++++++ Documentation/networking/altera_tse.txt | 263 ----------------------------- Documentation/networking/index.rst | 1 + 3 files changed, 287 insertions(+), 263 deletions(-) create mode 100644 Documentation/networking/altera_tse.rst delete mode 100644 Documentation/networking/altera_tse.txt (limited to 'Documentation') diff --git a/Documentation/networking/altera_tse.rst b/Documentation/networking/altera_tse.rst new file mode 100644 index 000000000000..7a7040072e58 --- /dev/null +++ b/Documentation/networking/altera_tse.rst @@ -0,0 +1,286 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. include:: + +======================================= +Altera Triple-Speed Ethernet MAC driver +======================================= + +Copyright |copy| 2008-2014 Altera Corporation + +This is the driver for the Altera Triple-Speed Ethernet (TSE) controllers +using the SGDMA and MSGDMA soft DMA IP components. The driver uses the +platform bus to obtain component resources. The designs used to test this +driver were built for a Cyclone(R) V SOC FPGA board, a Cyclone(R) V FPGA board, +and tested with ARM and NIOS processor hosts separately. The anticipated use +cases are simple communications between an embedded system and an external peer +for status and simple configuration of the embedded system. + +For more information visit www.altera.com and www.rocketboards.org. Support +forums for the driver may be found on www.rocketboards.org, and a design used +to test this driver may be found there as well. Support is also available from +the maintainer of this driver, found in MAINTAINERS. + +The Triple-Speed Ethernet, SGDMA, and MSGDMA components are all soft IP +components that can be assembled and built into an FPGA using the Altera +Quartus toolchain. Quartus 13.1 and 14.0 were used to build the design that +this driver was tested against. The sopc2dts tool is used to create the +device tree for the driver, and may be found at rocketboards.org. + +The driver probe function examines the device tree and determines if the +Triple-Speed Ethernet instance is using an SGDMA or MSGDMA component. The +probe function then installs the appropriate set of DMA routines to +initialize, setup transmits, receives, and interrupt handling primitives for +the respective configurations. + +The SGDMA component is to be deprecated in the near future (over the next 1-2 +years as of this writing in early 2014) in favor of the MSGDMA component. +SGDMA support is included for existing designs and reference in case a +developer wishes to support their own soft DMA logic and driver support. Any +new designs should not use the SGDMA. + +The SGDMA supports only a single transmit or receive operation at a time, and +therefore will not perform as well compared to the MSGDMA soft IP. Please +visit www.altera.com for known, documented SGDMA errata. + +Scatter-gather DMA is not supported by the SGDMA or MSGDMA at this time. +Scatter-gather DMA will be added to a future maintenance update to this +driver. + +Jumbo frames are not supported at this time. + +The driver limits PHY operations to 10/100Mbps, and has not yet been fully +tested for 1Gbps. This support will be added in a future maintenance update. + +1. Kernel Configuration +======================= + +The kernel configuration option is ALTERA_TSE: + + Device Drivers ---> Network device support ---> Ethernet driver support ---> + Altera Triple-Speed Ethernet MAC support (ALTERA_TSE) + +2. Driver parameters list +========================= + + - debug: message level (0: no output, 16: all); + - dma_rx_num: Number of descriptors in the RX list (default is 64); + - dma_tx_num: Number of descriptors in the TX list (default is 64). + +3. Command line options +======================= + +Driver parameters can be also passed in command line by using:: + + altera_tse=dma_rx_num:128,dma_tx_num:512 + +4. Driver information and notes +=============================== + +4.1. Transmit process +--------------------- +When the driver's transmit routine is called by the kernel, it sets up a +transmit descriptor by calling the underlying DMA transmit routine (SGDMA or +MSGDMA), and initiates a transmit operation. Once the transmit is complete, an +interrupt is driven by the transmit DMA logic. The driver handles the transmit +completion in the context of the interrupt handling chain by recycling +resource required to send and track the requested transmit operation. + +4.2. Receive process +-------------------- +The driver will post receive buffers to the receive DMA logic during driver +initialization. Receive buffers may or may not be queued depending upon the +underlying DMA logic (MSGDMA is able queue receive buffers, SGDMA is not able +to queue receive buffers to the SGDMA receive logic). When a packet is +received, the DMA logic generates an interrupt. The driver handles a receive +interrupt by obtaining the DMA receive logic status, reaping receive +completions until no more receive completions are available. + +4.3. Interrupt Mitigation +------------------------- +The driver is able to mitigate the number of its DMA interrupts +using NAPI for receive operations. Interrupt mitigation is not yet supported +for transmit operations, but will be added in a future maintenance release. + +4.4) Ethtool support +-------------------- +Ethtool is supported. Driver statistics and internal errors can be taken using: +ethtool -S ethX command. It is possible to dump registers etc. + +4.5) PHY Support +---------------- +The driver is compatible with PAL to work with PHY and GPHY devices. + +4.7) List of source files: +-------------------------- + - Kconfig + - Makefile + - altera_tse_main.c: main network device driver + - altera_tse_ethtool.c: ethtool support + - altera_tse.h: private driver structure and common definitions + - altera_msgdma.h: MSGDMA implementation function definitions + - altera_sgdma.h: SGDMA implementation function definitions + - altera_msgdma.c: MSGDMA implementation + - altera_sgdma.c: SGDMA implementation + - altera_sgdmahw.h: SGDMA register and descriptor definitions + - altera_msgdmahw.h: MSGDMA register and descriptor definitions + - altera_utils.c: Driver utility functions + - altera_utils.h: Driver utility function definitions + +5. Debug Information +==================== + +The driver exports debug information such as internal statistics, +debug information, MAC and DMA registers etc. + +A user may use the ethtool support to get statistics: +e.g. using: ethtool -S ethX (that shows the statistics counters) +or sees the MAC registers: e.g. using: ethtool -d ethX + +The developer can also use the "debug" module parameter to get +further debug information. + +6. Statistics Support +===================== + +The controller and driver support a mix of IEEE standard defined statistics, +RFC defined statistics, and driver or Altera defined statistics. The four +specifications containing the standard definitions for these statistics are +as follows: + + - IEEE 802.3-2012 - IEEE Standard for Ethernet. + - RFC 2863 found at http://www.rfc-editor.org/rfc/rfc2863.txt. + - RFC 2819 found at http://www.rfc-editor.org/rfc/rfc2819.txt. + - Altera Triple Speed Ethernet User Guide, found at http://www.altera.com + +The statistics supported by the TSE and the device driver are as follows: + +"tx_packets" is equivalent to aFramesTransmittedOK defined in IEEE 802.3-2012, +Section 5.2.2.1.2. This statistics is the count of frames that are successfully +transmitted. + +"rx_packets" is equivalent to aFramesReceivedOK defined in IEEE 802.3-2012, +Section 5.2.2.1.5. This statistic is the count of frames that are successfully +received. This count does not include any error packets such as CRC errors, +length errors, or alignment errors. + +"rx_crc_errors" is equivalent to aFrameCheckSequenceErrors defined in IEEE +802.3-2012, Section 5.2.2.1.6. This statistic is the count of frames that are +an integral number of bytes in length and do not pass the CRC test as the frame +is received. + +"rx_align_errors" is equivalent to aAlignmentErrors defined in IEEE 802.3-2012, +Section 5.2.2.1.7. This statistic is the count of frames that are not an +integral number of bytes in length and do not pass the CRC test as the frame is +received. + +"tx_bytes" is equivalent to aOctetsTransmittedOK defined in IEEE 802.3-2012, +Section 5.2.2.1.8. This statistic is the count of data and pad bytes +successfully transmitted from the interface. + +"rx_bytes" is equivalent to aOctetsReceivedOK defined in IEEE 802.3-2012, +Section 5.2.2.1.14. This statistic is the count of data and pad bytes +successfully received by the controller. + +"tx_pause" is equivalent to aPAUSEMACCtrlFramesTransmitted defined in IEEE +802.3-2012, Section 30.3.4.2. This statistic is a count of PAUSE frames +transmitted from the network controller. + +"rx_pause" is equivalent to aPAUSEMACCtrlFramesReceived defined in IEEE +802.3-2012, Section 30.3.4.3. This statistic is a count of PAUSE frames +received by the network controller. + +"rx_errors" is equivalent to ifInErrors defined in RFC 2863. This statistic is +a count of the number of packets received containing errors that prevented the +packet from being delivered to a higher level protocol. + +"tx_errors" is equivalent to ifOutErrors defined in RFC 2863. This statistic +is a count of the number of packets that could not be transmitted due to errors. + +"rx_unicast" is equivalent to ifInUcastPkts defined in RFC 2863. This +statistic is a count of the number of packets received that were not addressed +to the broadcast address or a multicast group. + +"rx_multicast" is equivalent to ifInMulticastPkts defined in RFC 2863. This +statistic is a count of the number of packets received that were addressed to +a multicast address group. + +"rx_broadcast" is equivalent to ifInBroadcastPkts defined in RFC 2863. This +statistic is a count of the number of packets received that were addressed to +the broadcast address. + +"tx_discards" is equivalent to ifOutDiscards defined in RFC 2863. This +statistic is the number of outbound packets not transmitted even though an +error was not detected. An example of a reason this might occur is to free up +internal buffer space. + +"tx_unicast" is equivalent to ifOutUcastPkts defined in RFC 2863. This +statistic counts the number of packets transmitted that were not addressed to +a multicast group or broadcast address. + +"tx_multicast" is equivalent to ifOutMulticastPkts defined in RFC 2863. This +statistic counts the number of packets transmitted that were addressed to a +multicast group. + +"tx_broadcast" is equivalent to ifOutBroadcastPkts defined in RFC 2863. This +statistic counts the number of packets transmitted that were addressed to a +broadcast address. + +"ether_drops" is equivalent to etherStatsDropEvents defined in RFC 2819. +This statistic counts the number of packets dropped due to lack of internal +controller resources. + +"rx_total_bytes" is equivalent to etherStatsOctets defined in RFC 2819. +This statistic counts the total number of bytes received by the controller, +including error and discarded packets. + +"rx_total_packets" is equivalent to etherStatsPkts defined in RFC 2819. +This statistic counts the total number of packets received by the controller, +including error, discarded, unicast, multicast, and broadcast packets. + +"rx_undersize" is equivalent to etherStatsUndersizePkts defined in RFC 2819. +This statistic counts the number of correctly formed packets received less +than 64 bytes long. + +"rx_oversize" is equivalent to etherStatsOversizePkts defined in RFC 2819. +This statistic counts the number of correctly formed packets greater than 1518 +bytes long. + +"rx_64_bytes" is equivalent to etherStatsPkts64Octets defined in RFC 2819. +This statistic counts the total number of packets received that were 64 octets +in length. + +"rx_65_127_bytes" is equivalent to etherStatsPkts65to127Octets defined in RFC +2819. This statistic counts the total number of packets received that were +between 65 and 127 octets in length inclusive. + +"rx_128_255_bytes" is equivalent to etherStatsPkts128to255Octets defined in +RFC 2819. This statistic is the total number of packets received that were +between 128 and 255 octets in length inclusive. + +"rx_256_511_bytes" is equivalent to etherStatsPkts256to511Octets defined in +RFC 2819. This statistic is the total number of packets received that were +between 256 and 511 octets in length inclusive. + +"rx_512_1023_bytes" is equivalent to etherStatsPkts512to1023Octets defined in +RFC 2819. This statistic is the total number of packets received that were +between 512 and 1023 octets in length inclusive. + +"rx_1024_1518_bytes" is equivalent to etherStatsPkts1024to1518Octets define +in RFC 2819. This statistic is the total number of packets received that were +between 1024 and 1518 octets in length inclusive. + +"rx_gte_1519_bytes" is a statistic defined specific to the behavior of the +Altera TSE. This statistics counts the number of received good and errored +frames between the length of 1519 and the maximum frame length configured +in the frm_length register. See the Altera TSE User Guide for More details. + +"rx_jabbers" is equivalent to etherStatsJabbers defined in RFC 2819. This +statistic is the total number of packets received that were longer than 1518 +octets, and had either a bad CRC with an integral number of octets (CRC Error) +or a bad CRC with a non-integral number of octets (Alignment Error). + +"rx_runts" is equivalent to etherStatsFragments defined in RFC 2819. This +statistic is the total number of packets received that were less than 64 octets +in length and had either a bad CRC with an integral number of octets (CRC +error) or a bad CRC with a non-integral number of octets (Alignment Error). diff --git a/Documentation/networking/altera_tse.txt b/Documentation/networking/altera_tse.txt deleted file mode 100644 index 50b8589d12fd..000000000000 --- a/Documentation/networking/altera_tse.txt +++ /dev/null @@ -1,263 +0,0 @@ - Altera Triple-Speed Ethernet MAC driver - -Copyright (C) 2008-2014 Altera Corporation - -This is the driver for the Altera Triple-Speed Ethernet (TSE) controllers -using the SGDMA and MSGDMA soft DMA IP components. The driver uses the -platform bus to obtain component resources. The designs used to test this -driver were built for a Cyclone(R) V SOC FPGA board, a Cyclone(R) V FPGA board, -and tested with ARM and NIOS processor hosts separately. The anticipated use -cases are simple communications between an embedded system and an external peer -for status and simple configuration of the embedded system. - -For more information visit www.altera.com and www.rocketboards.org. Support -forums for the driver may be found on www.rocketboards.org, and a design used -to test this driver may be found there as well. Support is also available from -the maintainer of this driver, found in MAINTAINERS. - -The Triple-Speed Ethernet, SGDMA, and MSGDMA components are all soft IP -components that can be assembled and built into an FPGA using the Altera -Quartus toolchain. Quartus 13.1 and 14.0 were used to build the design that -this driver was tested against. The sopc2dts tool is used to create the -device tree for the driver, and may be found at rocketboards.org. - -The driver probe function examines the device tree and determines if the -Triple-Speed Ethernet instance is using an SGDMA or MSGDMA component. The -probe function then installs the appropriate set of DMA routines to -initialize, setup transmits, receives, and interrupt handling primitives for -the respective configurations. - -The SGDMA component is to be deprecated in the near future (over the next 1-2 -years as of this writing in early 2014) in favor of the MSGDMA component. -SGDMA support is included for existing designs and reference in case a -developer wishes to support their own soft DMA logic and driver support. Any -new designs should not use the SGDMA. - -The SGDMA supports only a single transmit or receive operation at a time, and -therefore will not perform as well compared to the MSGDMA soft IP. Please -visit www.altera.com for known, documented SGDMA errata. - -Scatter-gather DMA is not supported by the SGDMA or MSGDMA at this time. -Scatter-gather DMA will be added to a future maintenance update to this -driver. - -Jumbo frames are not supported at this time. - -The driver limits PHY operations to 10/100Mbps, and has not yet been fully -tested for 1Gbps. This support will be added in a future maintenance update. - -1) Kernel Configuration -The kernel configuration option is ALTERA_TSE: - Device Drivers ---> Network device support ---> Ethernet driver support ---> - Altera Triple-Speed Ethernet MAC support (ALTERA_TSE) - -2) Driver parameters list: - debug: message level (0: no output, 16: all); - dma_rx_num: Number of descriptors in the RX list (default is 64); - dma_tx_num: Number of descriptors in the TX list (default is 64). - -3) Command line options -Driver parameters can be also passed in command line by using: - altera_tse=dma_rx_num:128,dma_tx_num:512 - -4) Driver information and notes - -4.1) Transmit process -When the driver's transmit routine is called by the kernel, it sets up a -transmit descriptor by calling the underlying DMA transmit routine (SGDMA or -MSGDMA), and initiates a transmit operation. Once the transmit is complete, an -interrupt is driven by the transmit DMA logic. The driver handles the transmit -completion in the context of the interrupt handling chain by recycling -resource required to send and track the requested transmit operation. - -4.2) Receive process -The driver will post receive buffers to the receive DMA logic during driver -initialization. Receive buffers may or may not be queued depending upon the -underlying DMA logic (MSGDMA is able queue receive buffers, SGDMA is not able -to queue receive buffers to the SGDMA receive logic). When a packet is -received, the DMA logic generates an interrupt. The driver handles a receive -interrupt by obtaining the DMA receive logic status, reaping receive -completions until no more receive completions are available. - -4.3) Interrupt Mitigation -The driver is able to mitigate the number of its DMA interrupts -using NAPI for receive operations. Interrupt mitigation is not yet supported -for transmit operations, but will be added in a future maintenance release. - -4.4) Ethtool support -Ethtool is supported. Driver statistics and internal errors can be taken using: -ethtool -S ethX command. It is possible to dump registers etc. - -4.5) PHY Support -The driver is compatible with PAL to work with PHY and GPHY devices. - -4.7) List of source files: - o Kconfig - o Makefile - o altera_tse_main.c: main network device driver - o altera_tse_ethtool.c: ethtool support - o altera_tse.h: private driver structure and common definitions - o altera_msgdma.h: MSGDMA implementation function definitions - o altera_sgdma.h: SGDMA implementation function definitions - o altera_msgdma.c: MSGDMA implementation - o altera_sgdma.c: SGDMA implementation - o altera_sgdmahw.h: SGDMA register and descriptor definitions - o altera_msgdmahw.h: MSGDMA register and descriptor definitions - o altera_utils.c: Driver utility functions - o altera_utils.h: Driver utility function definitions - -5) Debug Information - -The driver exports debug information such as internal statistics, -debug information, MAC and DMA registers etc. - -A user may use the ethtool support to get statistics: -e.g. using: ethtool -S ethX (that shows the statistics counters) -or sees the MAC registers: e.g. using: ethtool -d ethX - -The developer can also use the "debug" module parameter to get -further debug information. - -6) Statistics Support - -The controller and driver support a mix of IEEE standard defined statistics, -RFC defined statistics, and driver or Altera defined statistics. The four -specifications containing the standard definitions for these statistics are -as follows: - - o IEEE 802.3-2012 - IEEE Standard for Ethernet. - o RFC 2863 found at http://www.rfc-editor.org/rfc/rfc2863.txt. - o RFC 2819 found at http://www.rfc-editor.org/rfc/rfc2819.txt. - o Altera Triple Speed Ethernet User Guide, found at http://www.altera.com - -The statistics supported by the TSE and the device driver are as follows: - -"tx_packets" is equivalent to aFramesTransmittedOK defined in IEEE 802.3-2012, -Section 5.2.2.1.2. This statistics is the count of frames that are successfully -transmitted. - -"rx_packets" is equivalent to aFramesReceivedOK defined in IEEE 802.3-2012, -Section 5.2.2.1.5. This statistic is the count of frames that are successfully -received. This count does not include any error packets such as CRC errors, -length errors, or alignment errors. - -"rx_crc_errors" is equivalent to aFrameCheckSequenceErrors defined in IEEE -802.3-2012, Section 5.2.2.1.6. This statistic is the count of frames that are -an integral number of bytes in length and do not pass the CRC test as the frame -is received. - -"rx_align_errors" is equivalent to aAlignmentErrors defined in IEEE 802.3-2012, -Section 5.2.2.1.7. This statistic is the count of frames that are not an -integral number of bytes in length and do not pass the CRC test as the frame is -received. - -"tx_bytes" is equivalent to aOctetsTransmittedOK defined in IEEE 802.3-2012, -Section 5.2.2.1.8. This statistic is the count of data and pad bytes -successfully transmitted from the interface. - -"rx_bytes" is equivalent to aOctetsReceivedOK defined in IEEE 802.3-2012, -Section 5.2.2.1.14. This statistic is the count of data and pad bytes -successfully received by the controller. - -"tx_pause" is equivalent to aPAUSEMACCtrlFramesTransmitted defined in IEEE -802.3-2012, Section 30.3.4.2. This statistic is a count of PAUSE frames -transmitted from the network controller. - -"rx_pause" is equivalent to aPAUSEMACCtrlFramesReceived defined in IEEE -802.3-2012, Section 30.3.4.3. This statistic is a count of PAUSE frames -received by the network controller. - -"rx_errors" is equivalent to ifInErrors defined in RFC 2863. This statistic is -a count of the number of packets received containing errors that prevented the -packet from being delivered to a higher level protocol. - -"tx_errors" is equivalent to ifOutErrors defined in RFC 2863. This statistic -is a count of the number of packets that could not be transmitted due to errors. - -"rx_unicast" is equivalent to ifInUcastPkts defined in RFC 2863. This -statistic is a count of the number of packets received that were not addressed -to the broadcast address or a multicast group. - -"rx_multicast" is equivalent to ifInMulticastPkts defined in RFC 2863. This -statistic is a count of the number of packets received that were addressed to -a multicast address group. - -"rx_broadcast" is equivalent to ifInBroadcastPkts defined in RFC 2863. This -statistic is a count of the number of packets received that were addressed to -the broadcast address. - -"tx_discards" is equivalent to ifOutDiscards defined in RFC 2863. This -statistic is the number of outbound packets not transmitted even though an -error was not detected. An example of a reason this might occur is to free up -internal buffer space. - -"tx_unicast" is equivalent to ifOutUcastPkts defined in RFC 2863. This -statistic counts the number of packets transmitted that were not addressed to -a multicast group or broadcast address. - -"tx_multicast" is equivalent to ifOutMulticastPkts defined in RFC 2863. This -statistic counts the number of packets transmitted that were addressed to a -multicast group. - -"tx_broadcast" is equivalent to ifOutBroadcastPkts defined in RFC 2863. This -statistic counts the number of packets transmitted that were addressed to a -broadcast address. - -"ether_drops" is equivalent to etherStatsDropEvents defined in RFC 2819. -This statistic counts the number of packets dropped due to lack of internal -controller resources. - -"rx_total_bytes" is equivalent to etherStatsOctets defined in RFC 2819. -This statistic counts the total number of bytes received by the controller, -including error and discarded packets. - -"rx_total_packets" is equivalent to etherStatsPkts defined in RFC 2819. -This statistic counts the total number of packets received by the controller, -including error, discarded, unicast, multicast, and broadcast packets. - -"rx_undersize" is equivalent to etherStatsUndersizePkts defined in RFC 2819. -This statistic counts the number of correctly formed packets received less -than 64 bytes long. - -"rx_oversize" is equivalent to etherStatsOversizePkts defined in RFC 2819. -This statistic counts the number of correctly formed packets greater than 1518 -bytes long. - -"rx_64_bytes" is equivalent to etherStatsPkts64Octets defined in RFC 2819. -This statistic counts the total number of packets received that were 64 octets -in length. - -"rx_65_127_bytes" is equivalent to etherStatsPkts65to127Octets defined in RFC -2819. This statistic counts the total number of packets received that were -between 65 and 127 octets in length inclusive. - -"rx_128_255_bytes" is equivalent to etherStatsPkts128to255Octets defined in -RFC 2819. This statistic is the total number of packets received that were -between 128 and 255 octets in length inclusive. - -"rx_256_511_bytes" is equivalent to etherStatsPkts256to511Octets defined in -RFC 2819. This statistic is the total number of packets received that were -between 256 and 511 octets in length inclusive. - -"rx_512_1023_bytes" is equivalent to etherStatsPkts512to1023Octets defined in -RFC 2819. This statistic is the total number of packets received that were -between 512 and 1023 octets in length inclusive. - -"rx_1024_1518_bytes" is equivalent to etherStatsPkts1024to1518Octets define -in RFC 2819. This statistic is the total number of packets received that were -between 1024 and 1518 octets in length inclusive. - -"rx_gte_1519_bytes" is a statistic defined specific to the behavior of the -Altera TSE. This statistics counts the number of received good and errored -frames between the length of 1519 and the maximum frame length configured -in the frm_length register. See the Altera TSE User Guide for More details. - -"rx_jabbers" is equivalent to etherStatsJabbers defined in RFC 2819. This -statistic is the total number of packets received that were longer than 1518 -octets, and had either a bad CRC with an integral number of octets (CRC Error) -or a bad CRC with a non-integral number of octets (Alignment Error). - -"rx_runts" is equivalent to etherStatsFragments defined in RFC 2819. This -statistic is the total number of packets received that were less than 64 octets -in length and had either a bad CRC with an integral number of octets (CRC -error) or a bad CRC with a non-integral number of octets (Alignment Error). diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index dc37fc8d5bee..96ffad845fd9 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -38,6 +38,7 @@ Contents: nfc 6lowpan 6pack + altera_tse .. only:: subproject and html -- cgit v1.2.3 From aa92320b3e38f2b64b2d91a20761db1683e6c531 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:19 +0200 Subject: docs: networking: convert arcnet-hardware.txt to ReST - add SPDX header; - add document title markup; - add notes markups; - mark tables as such; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/arcnet-hardware.rst | 3234 ++++++++++++++++++++++++++ Documentation/networking/arcnet-hardware.txt | 3133 ------------------------- Documentation/networking/index.rst | 1 + 3 files changed, 3235 insertions(+), 3133 deletions(-) create mode 100644 Documentation/networking/arcnet-hardware.rst delete mode 100644 Documentation/networking/arcnet-hardware.txt (limited to 'Documentation') diff --git a/Documentation/networking/arcnet-hardware.rst b/Documentation/networking/arcnet-hardware.rst new file mode 100644 index 000000000000..b5a1a020c824 --- /dev/null +++ b/Documentation/networking/arcnet-hardware.rst @@ -0,0 +1,3234 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +ARCnet Hardware +=============== + +.. note:: + + 1) This file is a supplement to arcnet.txt. Please read that for general + driver configuration help. + 2) This file is no longer Linux-specific. It should probably be moved out + of the kernel sources. Ideas? + +Because so many people (myself included) seem to have obtained ARCnet cards +without manuals, this file contains a quick introduction to ARCnet hardware, +some cabling tips, and a listing of all jumper settings I can find. Please +e-mail apenwarr@worldvisions.ca with any settings for your particular card, +or any other information you have! + + +Introduction to ARCnet +====================== + +ARCnet is a network type which works in a way similar to popular Ethernet +networks but which is also different in some very important ways. + +First of all, you can get ARCnet cards in at least two speeds: 2.5 Mbps +(slower than Ethernet) and 100 Mbps (faster than normal Ethernet). In fact, +there are others as well, but these are less common. The different hardware +types, as far as I'm aware, are not compatible and so you cannot wire a +100 Mbps card to a 2.5 Mbps card, and so on. From what I hear, my driver does +work with 100 Mbps cards, but I haven't been able to verify this myself, +since I only have the 2.5 Mbps variety. It is probably not going to saturate +your 100 Mbps card. Stop complaining. :) + +You also cannot connect an ARCnet card to any kind of Ethernet card and +expect it to work. + +There are two "types" of ARCnet - STAR topology and BUS topology. This +refers to how the cards are meant to be wired together. According to most +available documentation, you can only connect STAR cards to STAR cards and +BUS cards to BUS cards. That makes sense, right? Well, it's not quite +true; see below under "Cabling." + +Once you get past these little stumbling blocks, ARCnet is actually quite a +well-designed standard. It uses something called "modified token passing" +which makes it completely incompatible with so-called "Token Ring" cards, +but which makes transfers much more reliable than Ethernet does. In fact, +ARCnet will guarantee that a packet arrives safely at the destination, and +even if it can't possibly be delivered properly (ie. because of a cable +break, or because the destination computer does not exist) it will at least +tell the sender about it. + +Because of the carefully defined action of the "token", it will always make +a pass around the "ring" within a maximum length of time. This makes it +useful for realtime networks. + +In addition, all known ARCnet cards have an (almost) identical programming +interface. This means that with one ARCnet driver you can support any +card, whereas with Ethernet each manufacturer uses what is sometimes a +completely different programming interface, leading to a lot of different, +sometimes very similar, Ethernet drivers. Of course, always using the same +programming interface also means that when high-performance hardware +facilities like PCI bus mastering DMA appear, it's hard to take advantage of +them. Let's not go into that. + +One thing that makes ARCnet cards difficult to program for, however, is the +limit on their packet sizes; standard ARCnet can only send packets that are +up to 508 bytes in length. This is smaller than the Internet "bare minimum" +of 576 bytes, let alone the Ethernet MTU of 1500. To compensate, an extra +level of encapsulation is defined by RFC1201, which I call "packet +splitting," that allows "virtual packets" to grow as large as 64K each, +although they are generally kept down to the Ethernet-style 1500 bytes. + +For more information on the advantages and disadvantages (mostly the +advantages) of ARCnet networks, you might try the "ARCnet Trade Association" +WWW page: + + http://www.arcnet.com + + +Cabling ARCnet Networks +======================= + +This section was rewritten by + + Vojtech Pavlik + +using information from several people, including: + + - Avery Pennraun + - Stephen A. Wood + - John Paul Morrison + - Joachim Koenig + +and Avery touched it up a bit, at Vojtech's request. + +ARCnet (the classic 2.5 Mbps version) can be connected by two different +types of cabling: coax and twisted pair. The other ARCnet-type networks +(100 Mbps TCNS and 320 kbps - 32 Mbps ARCnet Plus) use different types of +cabling (Type1, Fiber, C1, C4, C5). + +For a coax network, you "should" use 93 Ohm RG-62 cable. But other cables +also work fine, because ARCnet is a very stable network. I personally use 75 +Ohm TV antenna cable. + +Cards for coax cabling are shipped in two different variants: for BUS and +STAR network topologies. They are mostly the same. The only difference +lies in the hybrid chip installed. BUS cards use high impedance output, +while STAR use low impedance. Low impedance card (STAR) is electrically +equal to a high impedance one with a terminator installed. + +Usually, the ARCnet networks are built up from STAR cards and hubs. There +are two types of hubs - active and passive. Passive hubs are small boxes +with four BNC connectors containing four 47 Ohm resistors:: + + | | wires + R + junction + -R-+-R- R 47 Ohm resistors + R + | + +The shielding is connected together. Active hubs are much more complicated; +they are powered and contain electronics to amplify the signal and send it +to other segments of the net. They usually have eight connectors. Active +hubs come in two variants - dumb and smart. The dumb variant just +amplifies, but the smart one decodes to digital and encodes back all packets +coming through. This is much better if you have several hubs in the net, +since many dumb active hubs may worsen the signal quality. + +And now to the cabling. What you can connect together: + +1. A card to a card. This is the simplest way of creating a 2-computer + network. + +2. A card to a passive hub. Remember that all unused connectors on the hub + must be properly terminated with 93 Ohm (or something else if you don't + have the right ones) terminators. + + (Avery's note: oops, I didn't know that. Mine (TV cable) works + anyway, though.) + +3. A card to an active hub. Here is no need to terminate the unused + connectors except some kind of aesthetic feeling. But, there may not be + more than eleven active hubs between any two computers. That of course + doesn't limit the number of active hubs on the network. + +4. An active hub to another. + +5. An active hub to passive hub. + +Remember that you cannot connect two passive hubs together. The power loss +implied by such a connection is too high for the net to operate reliably. + +An example of a typical ARCnet network:: + + R S - STAR type card + S------H--------A-------S R - Terminator + | | H - Hub + | | A - Active hub + | S----H----S + S | + | + S + +The BUS topology is very similar to the one used by Ethernet. The only +difference is in cable and terminators: they should be 93 Ohm. Ethernet +uses 50 Ohm impedance. You use T connectors to put the computers on a single +line of cable, the bus. You have to put terminators at both ends of the +cable. A typical BUS ARCnet network looks like:: + + RT----T------T------T------T------TR + B B B B B B + + B - BUS type card + R - Terminator + T - T connector + +But that is not all! The two types can be connected together. According to +the official documentation the only way of connecting them is using an active +hub:: + + A------T------T------TR + | B B B + S---H---S + | + S + +The official docs also state that you can use STAR cards at the ends of +BUS network in place of a BUS card and a terminator:: + + S------T------T------S + B B + +But, according to my own experiments, you can simply hang a BUS type card +anywhere in middle of a cable in a STAR topology network. And more - you +can use the bus card in place of any star card if you use a terminator. Then +you can build very complicated networks fulfilling all your needs! An +example:: + + S + | + RT------T-------T------H------S + B B B | + | R + S------A------T-------T-------A-------H------TR + | B B | | B + | S BT | + | | | S----A-----S + S------H---A----S | | + | | S------T----H---S | + S S B R S + +A basically different cabling scheme is used with Twisted Pair cabling. Each +of the TP cards has two RJ (phone-cord style) connectors. The cards are +then daisy-chained together using a cable connecting every two neighboring +cards. The ends are terminated with RJ 93 Ohm terminators which plug into +the empty connectors of cards on the ends of the chain. An example:: + + ___________ ___________ + _R_|_ _|_|_ _|_R_ + | | | | | | + |Card | |Card | |Card | + |_____| |_____| |_____| + + +There are also hubs for the TP topology. There is nothing difficult +involved in using them; you just connect a TP chain to a hub on any end or +even at both. This way you can create almost any network configuration. +The maximum of 11 hubs between any two computers on the net applies here as +well. An example:: + + RP-------P--------P--------H-----P------P-----PR + | + RP-----H--------P--------H-----P------PR + | | + PR PR + + R - RJ Terminator + P - TP Card + H - TP Hub + +Like any network, ARCnet has a limited cable length. These are the maximum +cable lengths between two active ends (an active end being an active hub or +a STAR card). + + ========== ======= =========== + RG-62 93 Ohm up to 650 m + RG-59/U 75 Ohm up to 457 m + RG-11/U 75 Ohm up to 533 m + IBM Type 1 150 Ohm up to 200 m + IBM Type 3 100 Ohm up to 100 m + ========== ======= =========== + +The maximum length of all cables connected to a passive hub is limited to 65 +meters for RG-62 cabling; less for others. You can see that using passive +hubs in a large network is a bad idea. The maximum length of a single "BUS +Trunk" is about 300 meters for RG-62. The maximum distance between the two +most distant points of the net is limited to 3000 meters. The maximum length +of a TP cable between two cards/hubs is 650 meters. + + +Setting the Jumpers +=================== + +All ARCnet cards should have a total of four or five different settings: + + - the I/O address: this is the "port" your ARCnet card is on. Probed + values in the Linux ARCnet driver are only from 0x200 through 0x3F0. (If + your card has additional ones, which is possible, please tell me.) This + should not be the same as any other device on your system. According to + a doc I got from Novell, MS Windows prefers values of 0x300 or more, + eating net connections on my system (at least) otherwise. My guess is + this may be because, if your card is at 0x2E0, probing for a serial port + at 0x2E8 will reset the card and probably mess things up royally. + + - Avery's favourite: 0x300. + + - the IRQ: on 8-bit cards, it might be 2 (9), 3, 4, 5, or 7. + on 16-bit cards, it might be 2 (9), 3, 4, 5, 7, or 10-15. + + Make sure this is different from any other card on your system. Note + that IRQ2 is the same as IRQ9, as far as Linux is concerned. You can + "cat /proc/interrupts" for a somewhat complete list of which ones are in + use at any given time. Here is a list of common usages from Vojtech + Pavlik : + + ("Not on bus" means there is no way for a card to generate this + interrupt) + + ====== ========================================================= + IRQ 0 Timer 0 (Not on bus) + IRQ 1 Keyboard (Not on bus) + IRQ 2 IRQ Controller 2 (Not on bus, nor does interrupt the CPU) + IRQ 3 COM2 + IRQ 4 COM1 + IRQ 5 FREE (LPT2 if you have it; sometimes COM3; maybe PLIP) + IRQ 6 Floppy disk controller + IRQ 7 FREE (LPT1 if you don't use the polling driver; PLIP) + IRQ 8 Realtime Clock Interrupt (Not on bus) + IRQ 9 FREE (VGA vertical sync interrupt if enabled) + IRQ 10 FREE + IRQ 11 FREE + IRQ 12 FREE + IRQ 13 Numeric Coprocessor (Not on bus) + IRQ 14 Fixed Disk Controller + IRQ 15 FREE (Fixed Disk Controller 2 if you have it) + ====== ========================================================= + + + .. note:: + + IRQ 9 is used on some video cards for the "vertical retrace" + interrupt. This interrupt would have been handy for things like + video games, as it occurs exactly once per screen refresh, but + unfortunately IBM cancelled this feature starting with the original + VGA and thus many VGA/SVGA cards do not support it. For this + reason, no modern software uses this interrupt and it can almost + always be safely disabled, if your video card supports it at all. + + If your card for some reason CANNOT disable this IRQ (usually there + is a jumper), one solution would be to clip the printed circuit + contact on the board: it's the fourth contact from the left on the + back side. I take no responsibility if you try this. + + - Avery's favourite: IRQ2 (actually IRQ9). Watch that VGA, though. + + - the memory address: Unlike most cards, ARCnets use "shared memory" for + copying buffers around. Make SURE it doesn't conflict with any other + used memory in your system! + + :: + + A0000 - VGA graphics memory (ok if you don't have VGA) + B0000 - Monochrome text mode + C0000 \ One of these is your VGA BIOS - usually C0000. + E0000 / + F0000 - System BIOS + + Anything less than 0xA0000 is, well, a BAD idea since it isn't above + 640k. + + - Avery's favourite: 0xD0000 + + - the station address: Every ARCnet card has its own "unique" network + address from 0 to 255. Unlike Ethernet, you can set this address + yourself with a jumper or switch (or on some cards, with special + software). Since it's only 8 bits, you can only have 254 ARCnet cards + on a network. DON'T use 0 or 255, since these are reserved (although + neat stuff will probably happen if you DO use them). By the way, if you + haven't already guessed, don't set this the same as any other ARCnet on + your network! + + - Avery's favourite: 3 and 4. Not that it matters. + + - There may be ETS1 and ETS2 settings. These may or may not make a + difference on your card (many manuals call them "reserved"), but are + used to change the delays used when powering up a computer on the + network. This is only necessary when wiring VERY long range ARCnet + networks, on the order of 4km or so; in any case, the only real + requirement here is that all cards on the network with ETS1 and ETS2 + jumpers have them in the same position. Chris Hindy + sent in a chart with actual values for this: + + ======= ======= =============== ==================== + ET1 ET2 Response Time Reconfiguration Time + ======= ======= =============== ==================== + open open 74.7us 840us + open closed 283.4us 1680us + closed open 561.8us 1680us + closed closed 1118.6us 1680us + ======= ======= =============== ==================== + + Make sure you set ETS1 and ETS2 to the SAME VALUE for all cards on your + network. + +Also, on many cards (not mine, though) there are red and green LED's. +Vojtech Pavlik tells me this is what they mean: + + =============== =============== ===================================== + GREEN RED Status + =============== =============== ===================================== + OFF OFF Power off + OFF Short flashes Cabling problems (broken cable or not + terminated) + OFF (short) ON Card init + ON ON Normal state - everything OK, nothing + happens + ON Long flashes Data transfer + ON OFF Never happens (maybe when wrong ID) + =============== =============== ===================================== + + +The following is all the specific information people have sent me about +their own particular ARCnet cards. It is officially a mess, and contains +huge amounts of duplicated information. I have no time to fix it. If you +want to, PLEASE DO! Just send me a 'diff -u' of all your changes. + +The model # is listed right above specifics for that card, so you should be +able to use your text viewer's "search" function to find the entry you want. +If you don't KNOW what kind of card you have, try looking through the +various diagrams to see if you can tell. + +If your model isn't listed and/or has different settings, PLEASE PLEASE +tell me. I had to figure mine out without the manual, and it WASN'T FUN! + +Even if your ARCnet model isn't listed, but has the same jumpers as another +model that is, please e-mail me to say so. + +Cards Listed in this file (in this order, mostly): + + =============== ======================= ==== + Manufacturer Model # Bits + =============== ======================= ==== + SMC PC100 8 + SMC PC110 8 + SMC PC120 8 + SMC PC130 8 + SMC PC270E 8 + SMC PC500 16 + SMC PC500Longboard 16 + SMC PC550Longboard 16 + SMC PC600 16 + SMC PC710 8 + SMC? LCS-8830(-T) 8/16 + Puredata PDI507 8 + CNet Tech CN120-Series 8 + CNet Tech CN160-Series 16 + Lantech? UM9065L chipset 8 + Acer 5210-003 8 + Datapoint? LAN-ARC-8 8 + Topware TA-ARC/10 8 + Thomas-Conrad 500-6242-0097 REV A 8 + Waterloo? (C)1985 Waterloo Micro. 8 + No Name -- 8/16 + No Name Taiwan R.O.C? 8 + No Name Model 9058 8 + Tiara Tiara Lancard? 8 + =============== ======================= ==== + + +* SMC = Standard Microsystems Corp. +* CNet Tech = CNet Technology, Inc. + +Unclassified Stuff +================== + + - Please send any other information you can find. + + - And some other stuff (more info is welcome!):: + + From: root@ultraworld.xs4all.nl (Timo Hilbrink) + To: apenwarr@foxnet.net (Avery Pennarun) + Date: Wed, 26 Oct 1994 02:10:32 +0000 (GMT) + Reply-To: timoh@xs4all.nl + + [...parts deleted...] + + About the jumpers: On my PC130 there is one more jumper, located near the + cable-connector and it's for changing to star or bus topology; + closed: star - open: bus + On the PC500 are some more jumper-pins, one block labeled with RX,PDN,TXI + and another with ALE,LA17,LA18,LA19 these are undocumented.. + + [...more parts deleted...] + + --- CUT --- + +Standard Microsystems Corp (SMC) +================================ + +PC100, PC110, PC120, PC130 (8-bit cards) and PC500, PC600 (16-bit cards) +------------------------------------------------------------------------ + + - mainly from Avery Pennarun . Values depicted + are from Avery's setup. + - special thanks to Timo Hilbrink for noting that PC120, + 130, 500, and 600 all have the same switches as Avery's PC100. + PC500/600 have several extra, undocumented pins though. (?) + - PC110 settings were verified by Stephen A. Wood + - Also, the JP- and S-numbers probably don't match your card exactly. Try + to find jumpers/switches with the same number of settings - it's + probably more reliable. + +:: + + JP5 [|] : : : : + (IRQ Setting) IRQ2 IRQ3 IRQ4 IRQ5 IRQ7 + Put exactly one jumper on exactly one set of pins. + + + 1 2 3 4 5 6 7 8 9 10 + S1 /----------------------------------\ + (I/O and Memory | 1 1 * 0 0 0 0 * 1 1 0 1 | + addresses) \----------------------------------/ + |--| |--------| |--------| + (a) (b) (m) + + WARNING. It's very important when setting these which way + you're holding the card, and which way you think is '1'! + + If you suspect that your settings are not being made + correctly, try reversing the direction or inverting the + switch positions. + + a: The first digit of the I/O address. + Setting Value + ------- ----- + 00 0 + 01 1 + 10 2 + 11 3 + + b: The second digit of the I/O address. + Setting Value + ------- ----- + 0000 0 + 0001 1 + 0010 2 + ... ... + 1110 E + 1111 F + + The I/O address is in the form ab0. For example, if + a is 0x2 and b is 0xE, the address will be 0x2E0. + + DO NOT SET THIS LESS THAN 0x200!!!!! + + + m: The first digit of the memory address. + Setting Value + ------- ----- + 0000 0 + 0001 1 + 0010 2 + ... ... + 1110 E + 1111 F + + The memory address is in the form m0000. For example, if + m is D, the address will be 0xD0000. + + DO NOT SET THIS TO C0000, F0000, OR LESS THAN A0000! + + 1 2 3 4 5 6 7 8 + S2 /--------------------------\ + (Station Address) | 1 1 0 0 0 0 0 0 | + \--------------------------/ + + Setting Value + ------- ----- + 00000000 00 + 10000000 01 + 01000000 02 + ... + 01111111 FE + 11111111 FF + + Note that this is binary with the digits reversed! + + DO NOT SET THIS TO 0 OR 255 (0xFF)! + + +PC130E/PC270E (8-bit cards) +--------------------------- + + - from Juergen Seifert + +This description has been written by Juergen Seifert +using information from the following Original SMC Manual + + "Configuration Guide for ARCNET(R)-PC130E/PC270 Network + Controller Boards Pub. # 900.044A June, 1989" + +ARCNET is a registered trademark of the Datapoint Corporation +SMC is a registered trademark of the Standard Microsystems Corporation + +The PC130E is an enhanced version of the PC130 board, is equipped with a +standard BNC female connector for connection to RG-62/U coax cable. +Since this board is designed both for point-to-point connection in star +networks and for connection to bus networks, it is downwardly compatible +with all the other standard boards designed for coax networks (that is, +the PC120, PC110 and PC100 star topology boards and the PC220, PC210 and +PC200 bus topology boards). + +The PC270E is an enhanced version of the PC260 board, is equipped with two +modular RJ11-type jacks for connection to twisted pair wiring. +It can be used in a star or a daisy-chained network. + +:: + + 8 7 6 5 4 3 2 1 + ________________________________________________________________ + | | S1 | | + | |_________________| | + | Offs|Base |I/O Addr | + | RAM Addr | ___| + | ___ ___ CR3 |___| + | | \/ | CR4 |___| + | | PROM | ___| + | | | N | | 8 + | | SOCKET | o | | 7 + | |________| d | | 6 + | ___________________ e | | 5 + | | | A | S | 4 + | |oo| EXT2 | | d | 2 | 3 + | |oo| EXT1 | SMC | d | | 2 + | |oo| ROM | 90C63 | r |___| 1 + | |oo| IRQ7 | | |o| _____| + | |oo| IRQ5 | | |o| | J1 | + | |oo| IRQ4 | | STAR |_____| + | |oo| IRQ3 | | | J2 | + | |oo| IRQ2 |___________________| |_____| + |___ ______________| + | | + |_____________________________________________| + +Legend:: + + SMC 90C63 ARCNET Controller / Transceiver /Logic + S1 1-3: I/O Base Address Select + 4-6: Memory Base Address Select + 7-8: RAM Offset Select + S2 1-8: Node ID Select + EXT Extended Timeout Select + ROM ROM Enable Select + STAR Selected - Star Topology (PC130E only) + Deselected - Bus Topology (PC130E only) + CR3/CR4 Diagnostic LEDs + J1 BNC RG62/U Connector (PC130E only) + J1 6-position Telephone Jack (PC270E only) + J2 6-position Telephone Jack (PC270E only) + +Setting one of the switches to Off/Open means "1", On/Closed means "0". + + +Setting the Node ID +^^^^^^^^^^^^^^^^^^^ + +The eight switches in group S2 are used to set the node ID. +These switches work in a way similar to the PC100-series cards; see that +entry for more information. + + +Setting the I/O Base Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The first three switches in switch group S1 are used to select one +of eight possible I/O Base addresses using the following table:: + + + Switch | Hex I/O + 1 2 3 | Address + -------|-------- + 0 0 0 | 260 + 0 0 1 | 290 + 0 1 0 | 2E0 (Manufacturer's default) + 0 1 1 | 2F0 + 1 0 0 | 300 + 1 0 1 | 350 + 1 1 0 | 380 + 1 1 1 | 3E0 + + +Setting the Base Memory (RAM) buffer Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The memory buffer requires 2K of a 16K block of RAM. The base of this +16K block can be located in any of eight positions. +Switches 4-6 of switch group S1 select the Base of the 16K block. +Within that 16K address space, the buffer may be assigned any one of four +positions, determined by the offset, switches 7 and 8 of group S1. + +:: + + Switch | Hex RAM | Hex ROM + 4 5 6 7 8 | Address | Address *) + -----------|---------|----------- + 0 0 0 0 0 | C0000 | C2000 + 0 0 0 0 1 | C0800 | C2000 + 0 0 0 1 0 | C1000 | C2000 + 0 0 0 1 1 | C1800 | C2000 + | | + 0 0 1 0 0 | C4000 | C6000 + 0 0 1 0 1 | C4800 | C6000 + 0 0 1 1 0 | C5000 | C6000 + 0 0 1 1 1 | C5800 | C6000 + | | + 0 1 0 0 0 | CC000 | CE000 + 0 1 0 0 1 | CC800 | CE000 + 0 1 0 1 0 | CD000 | CE000 + 0 1 0 1 1 | CD800 | CE000 + | | + 0 1 1 0 0 | D0000 | D2000 (Manufacturer's default) + 0 1 1 0 1 | D0800 | D2000 + 0 1 1 1 0 | D1000 | D2000 + 0 1 1 1 1 | D1800 | D2000 + | | + 1 0 0 0 0 | D4000 | D6000 + 1 0 0 0 1 | D4800 | D6000 + 1 0 0 1 0 | D5000 | D6000 + 1 0 0 1 1 | D5800 | D6000 + | | + 1 0 1 0 0 | D8000 | DA000 + 1 0 1 0 1 | D8800 | DA000 + 1 0 1 1 0 | D9000 | DA000 + 1 0 1 1 1 | D9800 | DA000 + | | + 1 1 0 0 0 | DC000 | DE000 + 1 1 0 0 1 | DC800 | DE000 + 1 1 0 1 0 | DD000 | DE000 + 1 1 0 1 1 | DD800 | DE000 + | | + 1 1 1 0 0 | E0000 | E2000 + 1 1 1 0 1 | E0800 | E2000 + 1 1 1 1 0 | E1000 | E2000 + 1 1 1 1 1 | E1800 | E2000 + + *) To enable the 8K Boot PROM install the jumper ROM. + The default is jumper ROM not installed. + + +Setting the Timeouts and Interrupt +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The jumpers labeled EXT1 and EXT2 are used to determine the timeout +parameters. These two jumpers are normally left open. + +To select a hardware interrupt level set one (only one!) of the jumpers +IRQ2, IRQ3, IRQ4, IRQ5, IRQ7. The Manufacturer's default is IRQ2. + + +Configuring the PC130E for Star or Bus Topology +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The single jumper labeled STAR is used to configure the PC130E board for +star or bus topology. +When the jumper is installed, the board may be used in a star network, when +it is removed, the board can be used in a bus topology. + + +Diagnostic LEDs +^^^^^^^^^^^^^^^ + +Two diagnostic LEDs are visible on the rear bracket of the board. +The green LED monitors the network activity: the red one shows the +board activity:: + + Green | Status Red | Status + -------|------------------- ---------|------------------- + on | normal activity flash/on | data transfer + blink | reconfiguration off | no data transfer; + off | defective board or | incorrect memory or + | node ID is zero | I/O address + + +PC500/PC550 Longboard (16-bit cards) +------------------------------------ + + - from Juergen Seifert + + + .. note:: + + There is another Version of the PC500 called Short Version, which + is different in hard- and software! The most important differences + are: + + - The long board has no Shared memory. + - On the long board the selection of the interrupt is done by binary + coded switch, on the short board directly by jumper. + +[Avery's note: pay special attention to that: the long board HAS NO SHARED +MEMORY. This means the current Linux-ARCnet driver can't use these cards. +I have obtained a PC500Longboard and will be doing some experiments on it in +the future, but don't hold your breath. Thanks again to Juergen Seifert for +his advice about this!] + +This description has been written by Juergen Seifert +using information from the following Original SMC Manual + + "Configuration Guide for SMC ARCNET-PC500/PC550 + Series Network Controller Boards Pub. # 900.033 Rev. A + November, 1989" + +ARCNET is a registered trademark of the Datapoint Corporation +SMC is a registered trademark of the Standard Microsystems Corporation + +The PC500 is equipped with a standard BNC female connector for connection +to RG-62/U coax cable. +The board is designed both for point-to-point connection in star networks +and for connection to bus networks. + +The PC550 is equipped with two modular RJ11-type jacks for connection +to twisted pair wiring. +It can be used in a star or a daisy-chained (BUS) network. + +:: + + 1 + 0 9 8 7 6 5 4 3 2 1 6 5 4 3 2 1 + ____________________________________________________________________ + < | SW1 | | SW2 | | + > |_____________________| |_____________| | + < IRQ |I/O Addr | + > ___| + < CR4 |___| + > CR3 |___| + < ___| + > N | | 8 + < o | | 7 + > d | S | 6 + < e | W | 5 + > A | 3 | 4 + < d | | 3 + > d | | 2 + < r |___| 1 + > |o| _____| + < |o| | J1 | + > 3 1 JP6 |_____| + < |o|o| JP2 | J2 | + > |o|o| |_____| + < 4 2__ ______________| + > | | | + <____| |_____________________________________________| + +Legend:: + + SW1 1-6: I/O Base Address Select + 7-10: Interrupt Select + SW2 1-6: Reserved for Future Use + SW3 1-8: Node ID Select + JP2 1-4: Extended Timeout Select + JP6 Selected - Star Topology (PC500 only) + Deselected - Bus Topology (PC500 only) + CR3 Green Monitors Network Activity + CR4 Red Monitors Board Activity + J1 BNC RG62/U Connector (PC500 only) + J1 6-position Telephone Jack (PC550 only) + J2 6-position Telephone Jack (PC550 only) + +Setting one of the switches to Off/Open means "1", On/Closed means "0". + + +Setting the Node ID +^^^^^^^^^^^^^^^^^^^ + +The eight switches in group SW3 are used to set the node ID. Each node +attached to the network must have an unique node ID which must be +different from 0. +Switch 1 serves as the least significant bit (LSB). + +The node ID is the sum of the values of all switches set to "1" +These values are:: + + Switch | Value + -------|------- + 1 | 1 + 2 | 2 + 3 | 4 + 4 | 8 + 5 | 16 + 6 | 32 + 7 | 64 + 8 | 128 + +Some Examples:: + + Switch | Hex | Decimal + 8 7 6 5 4 3 2 1 | Node ID | Node ID + ----------------|---------|--------- + 0 0 0 0 0 0 0 0 | not allowed + 0 0 0 0 0 0 0 1 | 1 | 1 + 0 0 0 0 0 0 1 0 | 2 | 2 + 0 0 0 0 0 0 1 1 | 3 | 3 + . . . | | + 0 1 0 1 0 1 0 1 | 55 | 85 + . . . | | + 1 0 1 0 1 0 1 0 | AA | 170 + . . . | | + 1 1 1 1 1 1 0 1 | FD | 253 + 1 1 1 1 1 1 1 0 | FE | 254 + 1 1 1 1 1 1 1 1 | FF | 255 + + +Setting the I/O Base Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The first six switches in switch group SW1 are used to select one +of 32 possible I/O Base addresses using the following table:: + + Switch | Hex I/O + 6 5 4 3 2 1 | Address + -------------|-------- + 0 1 0 0 0 0 | 200 + 0 1 0 0 0 1 | 210 + 0 1 0 0 1 0 | 220 + 0 1 0 0 1 1 | 230 + 0 1 0 1 0 0 | 240 + 0 1 0 1 0 1 | 250 + 0 1 0 1 1 0 | 260 + 0 1 0 1 1 1 | 270 + 0 1 1 0 0 0 | 280 + 0 1 1 0 0 1 | 290 + 0 1 1 0 1 0 | 2A0 + 0 1 1 0 1 1 | 2B0 + 0 1 1 1 0 0 | 2C0 + 0 1 1 1 0 1 | 2D0 + 0 1 1 1 1 0 | 2E0 (Manufacturer's default) + 0 1 1 1 1 1 | 2F0 + 1 1 0 0 0 0 | 300 + 1 1 0 0 0 1 | 310 + 1 1 0 0 1 0 | 320 + 1 1 0 0 1 1 | 330 + 1 1 0 1 0 0 | 340 + 1 1 0 1 0 1 | 350 + 1 1 0 1 1 0 | 360 + 1 1 0 1 1 1 | 370 + 1 1 1 0 0 0 | 380 + 1 1 1 0 0 1 | 390 + 1 1 1 0 1 0 | 3A0 + 1 1 1 0 1 1 | 3B0 + 1 1 1 1 0 0 | 3C0 + 1 1 1 1 0 1 | 3D0 + 1 1 1 1 1 0 | 3E0 + 1 1 1 1 1 1 | 3F0 + + +Setting the Interrupt +^^^^^^^^^^^^^^^^^^^^^ + +Switches seven through ten of switch group SW1 are used to select the +interrupt level. The interrupt level is binary coded, so selections +from 0 to 15 would be possible, but only the following eight values will +be supported: 3, 4, 5, 7, 9, 10, 11, 12. + +:: + + Switch | IRQ + 10 9 8 7 | + ---------|-------- + 0 0 1 1 | 3 + 0 1 0 0 | 4 + 0 1 0 1 | 5 + 0 1 1 1 | 7 + 1 0 0 1 | 9 (=2) (default) + 1 0 1 0 | 10 + 1 0 1 1 | 11 + 1 1 0 0 | 12 + + +Setting the Timeouts +^^^^^^^^^^^^^^^^^^^^ + +The two jumpers JP2 (1-4) are used to determine the timeout parameters. +These two jumpers are normally left open. +Refer to the COM9026 Data Sheet for alternate configurations. + + +Configuring the PC500 for Star or Bus Topology +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The single jumper labeled JP6 is used to configure the PC500 board for +star or bus topology. +When the jumper is installed, the board may be used in a star network, when +it is removed, the board can be used in a bus topology. + + +Diagnostic LEDs +^^^^^^^^^^^^^^^ + +Two diagnostic LEDs are visible on the rear bracket of the board. +The green LED monitors the network activity: the red one shows the +board activity:: + + Green | Status Red | Status + -------|------------------- ---------|------------------- + on | normal activity flash/on | data transfer + blink | reconfiguration off | no data transfer; + off | defective board or | incorrect memory or + | node ID is zero | I/O address + + +PC710 (8-bit card) +------------------ + + - from J.S. van Oosten + +Note: this data is gathered by experimenting and looking at info of other +cards. However, I'm sure I got 99% of the settings right. + +The SMC710 card resembles the PC270 card, but is much more basic (i.e. no +LEDs, RJ11 jacks, etc.) and 8 bit. Here's a little drawing:: + + _______________________________________ + | +---------+ +---------+ |____ + | | S2 | | S1 | | + | +---------+ +---------+ | + | | + | +===+ __ | + | | R | | | X-tal ###___ + | | O | |__| ####__'| + | | M | || ### + | +===+ | + | | + | .. JP1 +----------+ | + | .. | big chip | | + | .. | 90C63 | | + | .. | | | + | .. +----------+ | + ------- ----------- + ||||||||||||||||||||| + +The row of jumpers at JP1 actually consists of 8 jumpers, (sometimes +labelled) the same as on the PC270, from top to bottom: EXT2, EXT1, ROM, +IRQ7, IRQ5, IRQ4, IRQ3, IRQ2 (gee, wonder what they would do? :-) ) + +S1 and S2 perform the same function as on the PC270, only their numbers +are swapped (S1 is the nodeaddress, S2 sets IO- and RAM-address). + +I know it works when connected to a PC110 type ARCnet board. + + +***************************************************************************** + +Possibly SMC +============ + +LCS-8830(-T) (8 and 16-bit cards) +--------------------------------- + + - from Mathias Katzer + - Marek Michalkiewicz says the + LCS-8830 is slightly different from LCS-8830-T. These are 8 bit, BUS + only (the JP0 jumper is hardwired), and BNC only. + +This is a LCS-8830-T made by SMC, I think ('SMC' only appears on one PLCC, +nowhere else, not even on the few Xeroxed sheets from the manual). + +SMC ARCnet Board Type LCS-8830-T:: + + ------------------------------------ + | | + | JP3 88 8 JP2 | + | ##### | \ | + | ##### ET1 ET2 ###| + | 8 ###| + | U3 SW 1 JP0 ###| Phone Jacks + | -- ###| + | | | | + | | | SW2 | + | | | | + | | | ##### | + | -- ##### #### BNC Connector + | #### + | 888888 JP1 | + | 234567 | + -- ------- + ||||||||||||||||||||||||||| + -------------------------- + + + SW1: DIP-Switches for Station Address + SW2: DIP-Switches for Memory Base and I/O Base addresses + + JP0: If closed, internal termination on (default open) + JP1: IRQ Jumpers + JP2: Boot-ROM enabled if closed + JP3: Jumpers for response timeout + + U3: Boot-ROM Socket + + + ET1 ET2 Response Time Idle Time Reconfiguration Time + + 78 86 840 + X 285 316 1680 + X 563 624 1680 + X X 1130 1237 1680 + + (X means closed jumper) + + (DIP-Switch downwards means "0") + +The station address is binary-coded with SW1. + +The I/O base address is coded with DIP-Switches 6,7 and 8 of SW2: + +======== ======== +Switches Base +678 Address +======== ======== +000 260-26f +100 290-29f +010 2e0-2ef +110 2f0-2ff +001 300-30f +101 350-35f +011 380-38f +111 3e0-3ef +======== ======== + + +DIP Switches 1-5 of SW2 encode the RAM and ROM Address Range: + +======== ============= ================ +Switches RAM ROM +12345 Address Range Address Range +======== ============= ================ +00000 C:0000-C:07ff C:2000-C:3fff +10000 C:0800-C:0fff +01000 C:1000-C:17ff +11000 C:1800-C:1fff +00100 C:4000-C:47ff C:6000-C:7fff +10100 C:4800-C:4fff +01100 C:5000-C:57ff +11100 C:5800-C:5fff +00010 C:C000-C:C7ff C:E000-C:ffff +10010 C:C800-C:Cfff +01010 C:D000-C:D7ff +11010 C:D800-C:Dfff +00110 D:0000-D:07ff D:2000-D:3fff +10110 D:0800-D:0fff +01110 D:1000-D:17ff +11110 D:1800-D:1fff +00001 D:4000-D:47ff D:6000-D:7fff +10001 D:4800-D:4fff +01001 D:5000-D:57ff +11001 D:5800-D:5fff +00101 D:8000-D:87ff D:A000-D:bfff +10101 D:8800-D:8fff +01101 D:9000-D:97ff +11101 D:9800-D:9fff +00011 D:C000-D:c7ff D:E000-D:ffff +10011 D:C800-D:cfff +01011 D:D000-D:d7ff +11011 D:D800-D:dfff +00111 E:0000-E:07ff E:2000-E:3fff +10111 E:0800-E:0fff +01111 E:1000-E:17ff +11111 E:1800-E:1fff +======== ============= ================ + + +PureData Corp +============= + +PDI507 (8-bit card) +-------------------- + + - from Mark Rejhon (slight modifications by Avery) + - Avery's note: I think PDI508 cards (but definitely NOT PDI508Plus cards) + are mostly the same as this. PDI508Plus cards appear to be mainly + software-configured. + +Jumpers: + + There is a jumper array at the bottom of the card, near the edge + connector. This array is labelled J1. They control the IRQs and + something else. Put only one jumper on the IRQ pins. + + ETS1, ETS2 are for timing on very long distance networks. See the + more general information near the top of this file. + + There is a J2 jumper on two pins. A jumper should be put on them, + since it was already there when I got the card. I don't know what + this jumper is for though. + + There is a two-jumper array for J3. I don't know what it is for, + but there were already two jumpers on it when I got the card. It's + a six pin grid in a two-by-three fashion. The jumpers were + configured as follows:: + + .-------. + o | o o | + :-------: ------> Accessible end of card with connectors + o | o o | in this direction -------> + `-------' + +Carl de Billy explains J3 and J4: + + J3 Diagram:: + + .-------. + o | o o | + :-------: TWIST Technology + o | o o | + `-------' + .-------. + | o o | o + :-------: COAX Technology + | o o | o + `-------' + + - If using coax cable in a bus topology the J4 jumper must be removed; + place it on one pin. + + - If using bus topology with twisted pair wiring move the J3 + jumpers so they connect the middle pin and the pins closest to the RJ11 + Connectors. Also the J4 jumper must be removed; place it on one pin of + J4 jumper for storage. + + - If using star topology with twisted pair wiring move the J3 + jumpers so they connect the middle pin and the pins closest to the RJ11 + connectors. + + +DIP Switches: + + The DIP switches accessible on the accessible end of the card while + it is installed, is used to set the ARCnet address. There are 8 + switches. Use an address from 1 to 254 + + ========== ========================= + Switch No. ARCnet address + 12345678 + ========== ========================= + 00000000 FF (Don't use this!) + 00000001 FE + 00000010 FD + ... + 11111101 2 + 11111110 1 + 11111111 0 (Don't use this!) + ========== ========================= + + There is another array of eight DIP switches at the top of the + card. There are five labelled MS0-MS4 which seem to control the + memory address, and another three labelled IO0-IO2 which seem to + control the base I/O address of the card. + + This was difficult to test by trial and error, and the I/O addresses + are in a weird order. This was tested by setting the DIP switches, + rebooting the computer, and attempting to load ARCETHER at various + addresses (mostly between 0x200 and 0x400). The address that caused + the red transmit LED to blink, is the one that I thought works. + + Also, the address 0x3D0 seem to have a special meaning, since the + ARCETHER packet driver loaded fine, but without the red LED + blinking. I don't know what 0x3D0 is for though. I recommend using + an address of 0x300 since Windows may not like addresses below + 0x300. + + ============= =========== + IO Switch No. I/O address + 210 + ============= =========== + 111 0x260 + 110 0x290 + 101 0x2E0 + 100 0x2F0 + 011 0x300 + 010 0x350 + 001 0x380 + 000 0x3E0 + ============= =========== + + The memory switches set a reserved address space of 0x1000 bytes + (0x100 segment units, or 4k). For example if I set an address of + 0xD000, it will use up addresses 0xD000 to 0xD100. + + The memory switches were tested by booting using QEMM386 stealth, + and using LOADHI to see what address automatically became excluded + from the upper memory regions, and then attempting to load ARCETHER + using these addresses. + + I recommend using an ARCnet memory address of 0xD000, and putting + the EMS page frame at 0xC000 while using QEMM stealth mode. That + way, you get contiguous high memory from 0xD100 almost all the way + the end of the megabyte. + + Memory Switch 0 (MS0) didn't seem to work properly when set to OFF + on my card. It could be malfunctioning on my card. Experiment with + it ON first, and if it doesn't work, set it to OFF. (It may be a + modifier for the 0x200 bit?) + + ============= ============================================ + MS Switch No. + 43210 Memory address + ============= ============================================ + 00001 0xE100 (guessed - was not detected by QEMM) + 00011 0xE000 (guessed - was not detected by QEMM) + 00101 0xDD00 + 00111 0xDC00 + 01001 0xD900 + 01011 0xD800 + 01101 0xD500 + 01111 0xD400 + 10001 0xD100 + 10011 0xD000 + 10101 0xCD00 + 10111 0xCC00 + 11001 0xC900 (guessed - crashes tested system) + 11011 0xC800 (guessed - crashes tested system) + 11101 0xC500 (guessed - crashes tested system) + 11111 0xC400 (guessed - crashes tested system) + ============= ============================================ + +CNet Technology Inc. +==================== + +120 Series (8-bit cards) +------------------------ + - from Juergen Seifert + +This description has been written by Juergen Seifert +using information from the following Original CNet Manual + + "ARCNET USER'S MANUAL for + CN120A + CN120AB + CN120TP + CN120ST + CN120SBT + P/N:12-01-0007 + Revision 3.00" + +ARCNET is a registered trademark of the Datapoint Corporation + +- P/N 120A ARCNET 8 bit XT/AT Star +- P/N 120AB ARCNET 8 bit XT/AT Bus +- P/N 120TP ARCNET 8 bit XT/AT Twisted Pair +- P/N 120ST ARCNET 8 bit XT/AT Star, Twisted Pair +- P/N 120SBT ARCNET 8 bit XT/AT Star, Bus, Twisted Pair + +:: + + __________________________________________________________________ + | | + | ___| + | LED |___| + | ___| + | N | | ID7 + | o | | ID6 + | d | S | ID5 + | e | W | ID4 + | ___________________ A | 2 | ID3 + | | | d | | ID2 + | | | 1 2 3 4 5 6 7 8 d | | ID1 + | | | _________________ r |___| ID0 + | | 90C65 || SW1 | ____| + | JP 8 7 | ||_________________| | | + | |o|o| JP1 | | | J2 | + | |o|o| |oo| | | JP 1 1 1 | | + | ______________ | | 0 1 2 |____| + | | PROM | |___________________| |o|o|o| _____| + | > SOCKET | JP 6 5 4 3 2 |o|o|o| | J1 | + | |______________| |o|o|o|o|o| |o|o|o| |_____| + |_____ |o|o|o|o|o| ______________| + | | + |_____________________________________________| + +Legend:: + + 90C65 ARCNET Probe + S1 1-5: Base Memory Address Select + 6-8: Base I/O Address Select + S2 1-8: Node ID Select (ID0-ID7) + JP1 ROM Enable Select + JP2 IRQ2 + JP3 IRQ3 + JP4 IRQ4 + JP5 IRQ5 + JP6 IRQ7 + JP7/JP8 ET1, ET2 Timeout Parameters + JP10/JP11 Coax / Twisted Pair Select (CN120ST/SBT only) + JP12 Terminator Select (CN120AB/ST/SBT only) + J1 BNC RG62/U Connector (all except CN120TP) + J2 Two 6-position Telephone Jack (CN120TP/ST/SBT only) + +Setting one of the switches to Off means "1", On means "0". + + +Setting the Node ID +^^^^^^^^^^^^^^^^^^^ + +The eight switches in SW2 are used to set the node ID. Each node attached +to the network must have an unique node ID which must be different from 0. +Switch 1 (ID0) serves as the least significant bit (LSB). + +The node ID is the sum of the values of all switches set to "1" +These values are: + + ======= ====== ===== + Switch Label Value + ======= ====== ===== + 1 ID0 1 + 2 ID1 2 + 3 ID2 4 + 4 ID3 8 + 5 ID4 16 + 6 ID5 32 + 7 ID6 64 + 8 ID7 128 + ======= ====== ===== + +Some Examples:: + + Switch | Hex | Decimal + 8 7 6 5 4 3 2 1 | Node ID | Node ID + ----------------|---------|--------- + 0 0 0 0 0 0 0 0 | not allowed + 0 0 0 0 0 0 0 1 | 1 | 1 + 0 0 0 0 0 0 1 0 | 2 | 2 + 0 0 0 0 0 0 1 1 | 3 | 3 + . . . | | + 0 1 0 1 0 1 0 1 | 55 | 85 + . . . | | + 1 0 1 0 1 0 1 0 | AA | 170 + . . . | | + 1 1 1 1 1 1 0 1 | FD | 253 + 1 1 1 1 1 1 1 0 | FE | 254 + 1 1 1 1 1 1 1 1 | FF | 255 + + +Setting the I/O Base Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The last three switches in switch block SW1 are used to select one +of eight possible I/O Base addresses using the following table:: + + + Switch | Hex I/O + 6 7 8 | Address + ------------|-------- + ON ON ON | 260 + OFF ON ON | 290 + ON OFF ON | 2E0 (Manufacturer's default) + OFF OFF ON | 2F0 + ON ON OFF | 300 + OFF ON OFF | 350 + ON OFF OFF | 380 + OFF OFF OFF | 3E0 + + +Setting the Base Memory (RAM) buffer Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The memory buffer (RAM) requires 2K. The base of this buffer can be +located in any of eight positions. The address of the Boot Prom is +memory base + 8K or memory base + 0x2000. +Switches 1-5 of switch block SW1 select the Memory Base address. + +:: + + Switch | Hex RAM | Hex ROM + 1 2 3 4 5 | Address | Address *) + --------------------|---------|----------- + ON ON ON ON ON | C0000 | C2000 + ON ON OFF ON ON | C4000 | C6000 + ON ON ON OFF ON | CC000 | CE000 + ON ON OFF OFF ON | D0000 | D2000 (Manufacturer's default) + ON ON ON ON OFF | D4000 | D6000 + ON ON OFF ON OFF | D8000 | DA000 + ON ON ON OFF OFF | DC000 | DE000 + ON ON OFF OFF OFF | E0000 | E2000 + + *) To enable the Boot ROM install the jumper JP1 + +.. note:: + + Since the switches 1 and 2 are always set to ON it may be possible + that they can be used to add an offset of 2K, 4K or 6K to the base + address, but this feature is not documented in the manual and I + haven't tested it yet. + + +Setting the Interrupt Line +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To select a hardware interrupt level install one (only one!) of the jumpers +JP2, JP3, JP4, JP5, JP6. JP2 is the default:: + + Jumper | IRQ + -------|----- + 2 | 2 + 3 | 3 + 4 | 4 + 5 | 5 + 6 | 7 + + +Setting the Internal Terminator on CN120AB/TP/SBT +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The jumper JP12 is used to enable the internal terminator:: + + ----- + 0 | 0 | + ----- ON | | ON + | 0 | | 0 | + | | OFF ----- OFF + | 0 | 0 + ----- + Terminator Terminator + disabled enabled + + +Selecting the Connector Type on CN120ST/SBT +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +:: + + JP10 JP11 JP10 JP11 + ----- ----- + 0 0 | 0 | | 0 | + ----- ----- | | | | + | 0 | | 0 | | 0 | | 0 | + | | | | ----- ----- + | 0 | | 0 | 0 0 + ----- ----- + Coaxial Cable Twisted Pair Cable + (Default) + + +Setting the Timeout Parameters +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The jumpers labeled EXT1 and EXT2 are used to determine the timeout +parameters. These two jumpers are normally left open. + + +CNet Technology Inc. +==================== + +160 Series (16-bit cards) +------------------------- + - from Juergen Seifert + +This description has been written by Juergen Seifert +using information from the following Original CNet Manual + + "ARCNET USER'S MANUAL for + CN160A CN160AB CN160TP + P/N:12-01-0006 Revision 3.00" + +ARCNET is a registered trademark of the Datapoint Corporation + +- P/N 160A ARCNET 16 bit XT/AT Star +- P/N 160AB ARCNET 16 bit XT/AT Bus +- P/N 160TP ARCNET 16 bit XT/AT Twisted Pair + +:: + + ___________________________________________________________________ + < _________________________ ___| + > |oo| JP2 | | LED |___| + < |oo| JP1 | 9026 | LED |___| + > |_________________________| ___| + < N | | ID7 + > 1 o | | ID6 + < 1 2 3 4 5 6 7 8 9 0 d | S | ID5 + > _______________ _____________________ e | W | ID4 + < | PROM | | SW1 | A | 2 | ID3 + > > SOCKET | |_____________________| d | | ID2 + < |_______________| | IO-Base | MEM | d | | ID1 + > r |___| ID0 + < ____| + > | | + < | J1 | + > | | + < |____| + > 1 1 1 1 | + < 3 4 5 6 7 JP 8 9 0 1 2 3 | + > |o|o|o|o|o| |o|o|o|o|o|o| | + < |o|o|o|o|o| __ |o|o|o|o|o|o| ___________| + > | | | + <____________| |_______________________________________| + +Legend:: + + 9026 ARCNET Probe + SW1 1-6: Base I/O Address Select + 7-10: Base Memory Address Select + SW2 1-8: Node ID Select (ID0-ID7) + JP1/JP2 ET1, ET2 Timeout Parameters + JP3-JP13 Interrupt Select + J1 BNC RG62/U Connector (CN160A/AB only) + J1 Two 6-position Telephone Jack (CN160TP only) + LED + +Setting one of the switches to Off means "1", On means "0". + + +Setting the Node ID +^^^^^^^^^^^^^^^^^^^ + +The eight switches in SW2 are used to set the node ID. Each node attached +to the network must have an unique node ID which must be different from 0. +Switch 1 (ID0) serves as the least significant bit (LSB). + +The node ID is the sum of the values of all switches set to "1" +These values are:: + + Switch | Label | Value + -------|-------|------- + 1 | ID0 | 1 + 2 | ID1 | 2 + 3 | ID2 | 4 + 4 | ID3 | 8 + 5 | ID4 | 16 + 6 | ID5 | 32 + 7 | ID6 | 64 + 8 | ID7 | 128 + +Some Examples:: + + Switch | Hex | Decimal + 8 7 6 5 4 3 2 1 | Node ID | Node ID + ----------------|---------|--------- + 0 0 0 0 0 0 0 0 | not allowed + 0 0 0 0 0 0 0 1 | 1 | 1 + 0 0 0 0 0 0 1 0 | 2 | 2 + 0 0 0 0 0 0 1 1 | 3 | 3 + . . . | | + 0 1 0 1 0 1 0 1 | 55 | 85 + . . . | | + 1 0 1 0 1 0 1 0 | AA | 170 + . . . | | + 1 1 1 1 1 1 0 1 | FD | 253 + 1 1 1 1 1 1 1 0 | FE | 254 + 1 1 1 1 1 1 1 1 | FF | 255 + + +Setting the I/O Base Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The first six switches in switch block SW1 are used to select the I/O Base +address using the following table:: + + Switch | Hex I/O + 1 2 3 4 5 6 | Address + ------------------------|-------- + OFF ON ON OFF OFF ON | 260 + OFF ON OFF ON ON OFF | 290 + OFF ON OFF OFF OFF ON | 2E0 (Manufacturer's default) + OFF ON OFF OFF OFF OFF | 2F0 + OFF OFF ON ON ON ON | 300 + OFF OFF ON OFF ON OFF | 350 + OFF OFF OFF ON ON ON | 380 + OFF OFF OFF OFF OFF ON | 3E0 + +Note: Other IO-Base addresses seem to be selectable, but only the above + combinations are documented. + + +Setting the Base Memory (RAM) buffer Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The switches 7-10 of switch block SW1 are used to select the Memory +Base address of the RAM (2K) and the PROM:: + + Switch | Hex RAM | Hex ROM + 7 8 9 10 | Address | Address + ----------------|---------|----------- + OFF OFF ON ON | C0000 | C8000 + OFF OFF ON OFF | D0000 | D8000 (Default) + OFF OFF OFF ON | E0000 | E8000 + +.. note:: + + Other MEM-Base addresses seem to be selectable, but only the above + combinations are documented. + + +Setting the Interrupt Line +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To select a hardware interrupt level install one (only one!) of the jumpers +JP3 through JP13 using the following table:: + + Jumper | IRQ + -------|----------------- + 3 | 14 + 4 | 15 + 5 | 12 + 6 | 11 + 7 | 10 + 8 | 3 + 9 | 4 + 10 | 5 + 11 | 6 + 12 | 7 + 13 | 2 (=9) Default! + +.. note:: + + - Do not use JP11=IRQ6, it may conflict with your Floppy Disk + Controller + - Use JP3=IRQ14 only, if you don't have an IDE-, MFM-, or RLL- + Hard Disk, it may conflict with their controllers + + +Setting the Timeout Parameters +------------------------------ + +The jumpers labeled JP1 and JP2 are used to determine the timeout +parameters. These two jumpers are normally left open. + + +Lantech +======= + +8-bit card, unknown model +------------------------- + - from Vlad Lungu - his e-mail address seemed broken at + the time I tried to reach him. Sorry Vlad, if you didn't get my reply. + +:: + + ________________________________________________________________ + | 1 8 | + | ___________ __| + | | SW1 | LED |__| + | |__________| | + | ___| + | _____________________ |S | 8 + | | | |W | + | | | |2 | + | | | |__| 1 + | | UM9065L | |o| JP4 ____|____ + | | | |o| | CN | + | | | |________| + | | | | + | |___________________| | + | | + | | + | _____________ | + | | | | + | | PROM | |ooooo| JP6 | + | |____________| |ooooo| | + |_____________ _ _| + |____________________________________________| |__| + + +UM9065L : ARCnet Controller + +SW 1 : Shared Memory Address and I/O Base + +:: + + ON=0 + + 12345|Memory Address + -----|-------------- + 00001| D4000 + 00010| CC000 + 00110| D0000 + 01110| D1000 + 01101| D9000 + 10010| CC800 + 10011| DC800 + 11110| D1800 + +It seems that the bits are considered in reverse order. Also, you must +observe that some of those addresses are unusual and I didn't probe them; I +used a memory dump in DOS to identify them. For the 00000 configuration and +some others that I didn't write here the card seems to conflict with the +video card (an S3 GENDAC). I leave the full decoding of those addresses to +you. + +:: + + 678| I/O Address + ---|------------ + 000| 260 + 001| failed probe + 010| 2E0 + 011| 380 + 100| 290 + 101| 350 + 110| failed probe + 111| 3E0 + + SW 2 : Node ID (binary coded) + + JP 4 : Boot PROM enable CLOSE - enabled + OPEN - disabled + + JP 6 : IRQ set (ONLY ONE jumper on 1-5 for IRQ 2-6) + + +Acer +==== + +8-bit card, Model 5210-003 +-------------------------- + + - from Vojtech Pavlik using portions of the existing + arcnet-hardware file. + +This is a 90C26 based card. Its configuration seems similar to the SMC +PC100, but has some additional jumpers I don't know the meaning of. + +:: + + __ + | | + ___________|__|_________________________ + | | | | + | | BNC | | + | |______| ___| + | _____________________ |___ + | | | | + | | Hybrid IC | | + | | | o|o J1 | + | |_____________________| 8|8 | + | 8|8 J5 | + | o|o | + | 8|8 | + |__ 8|8 | + (|__| LED o|o | + | 8|8 | + | 8|8 J15 | + | | + | _____ | + | | | _____ | + | | | | | ___| + | | | | | | + | _____ | ROM | | UFS | | + | | | | | | | | + | | | ___ | | | | | + | | | | | |__.__| |__.__| | + | | NCR | |XTL| _____ _____ | + | | | |___| | | | | | + | |90C26| | | | | | + | | | | RAM | | UFS | | + | | | J17 o|o | | | | | + | | | J16 o|o | | | | | + | |__.__| |__.__| |__.__| | + | ___ | + | | |8 | + | |SW2| | + | | | | + | |___|1 | + | ___ | + | | |10 J18 o|o | + | | | o|o | + | |SW1| o|o | + | | | J21 o|o | + | |___|1 | + | | + |____________________________________| + + +Legend:: + + 90C26 ARCNET Chip + XTL 20 MHz Crystal + SW1 1-6 Base I/O Address Select + 7-10 Memory Address Select + SW2 1-8 Node ID Select (ID0-ID7) + J1-J5 IRQ Select + J6-J21 Unknown (Probably extra timeouts & ROM enable ...) + LED1 Activity LED + BNC Coax connector (STAR ARCnet) + RAM 2k of SRAM + ROM Boot ROM socket + UFS Unidentified Flying Sockets + + +Setting the Node ID +^^^^^^^^^^^^^^^^^^^ + +The eight switches in SW2 are used to set the node ID. Each node attached +to the network must have an unique node ID which must not be 0. +Switch 1 (ID0) serves as the least significant bit (LSB). + +Setting one of the switches to OFF means "1", ON means "0". + +The node ID is the sum of the values of all switches set to "1" +These values are:: + + Switch | Value + -------|------- + 1 | 1 + 2 | 2 + 3 | 4 + 4 | 8 + 5 | 16 + 6 | 32 + 7 | 64 + 8 | 128 + +Don't set this to 0 or 255; these values are reserved. + + +Setting the I/O Base Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The switches 1 to 6 of switch block SW1 are used to select one +of 32 possible I/O Base addresses using the following tables:: + + | Hex + Switch | Value + -------|------- + 1 | 200 + 2 | 100 + 3 | 80 + 4 | 40 + 5 | 20 + 6 | 10 + +The I/O address is sum of all switches set to "1". Remember that +the I/O address space bellow 0x200 is RESERVED for mainboard, so +switch 1 should be ALWAYS SET TO OFF. + + +Setting the Base Memory (RAM) buffer Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The memory buffer (RAM) requires 2K. The base of this buffer can be +located in any of sixteen positions. However, the addresses below +A0000 are likely to cause system hang because there's main RAM. + +Jumpers 7-10 of switch block SW1 select the Memory Base address:: + + Switch | Hex RAM + 7 8 9 10 | Address + ----------------|--------- + OFF OFF OFF OFF | F0000 (conflicts with main BIOS) + OFF OFF OFF ON | E0000 + OFF OFF ON OFF | D0000 + OFF OFF ON ON | C0000 (conflicts with video BIOS) + OFF ON OFF OFF | B0000 (conflicts with mono video) + OFF ON OFF ON | A0000 (conflicts with graphics) + + +Setting the Interrupt Line +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Jumpers 1-5 of the jumper block J1 control the IRQ level. ON means +shorted, OFF means open:: + + Jumper | IRQ + 1 2 3 4 5 | + ---------------------------- + ON OFF OFF OFF OFF | 7 + OFF ON OFF OFF OFF | 5 + OFF OFF ON OFF OFF | 4 + OFF OFF OFF ON OFF | 3 + OFF OFF OFF OFF ON | 2 + + +Unknown jumpers & sockets +^^^^^^^^^^^^^^^^^^^^^^^^^ + +I know nothing about these. I just guess that J16&J17 are timeout +jumpers and maybe one of J18-J21 selects ROM. Also J6-J10 and +J11-J15 are connecting IRQ2-7 to some pins on the UFSs. I can't +guess the purpose. + +Datapoint? +========== + +LAN-ARC-8, an 8-bit card +------------------------ + + - from Vojtech Pavlik + +This is another SMC 90C65-based ARCnet card. I couldn't identify the +manufacturer, but it might be DataPoint, because the card has the +original arcNet logo in its upper right corner. + +:: + + _______________________________________________________ + | _________ | + | | SW2 | ON arcNet | + | |_________| OFF ___| + | _____________ 1 ______ 8 | | 8 + | | | SW1 | XTAL | ____________ | S | + | > RAM (2k) | |______|| | | W | + | |_____________| | H | | 3 | + | _________|_____ y | |___| 1 + | _________ | | |b | | + | |_________| | | |r | | + | | SMC | |i | | + | | 90C65| |d | | + | _________ | | | | | + | | SW1 | ON | | |I | | + | |_________| OFF |_________|_____/C | _____| + | 1 8 | | | |___ + | ______________ | | | BNC |___| + | | | |____________| |_____| + | > EPROM SOCKET | _____________ | + | |______________| |_____________| | + | ______________| + | | + |________________________________________| + +Legend:: + + 90C65 ARCNET Chip + SW1 1-5: Base Memory Address Select + 6-8: Base I/O Address Select + SW2 1-8: Node ID Select + SW3 1-5: IRQ Select + 6-7: Extra Timeout + 8 : ROM Enable + BNC Coax connector + XTAL 20 MHz Crystal + + +Setting the Node ID +^^^^^^^^^^^^^^^^^^^ + +The eight switches in SW3 are used to set the node ID. Each node attached +to the network must have an unique node ID which must not be 0. +Switch 1 serves as the least significant bit (LSB). + +Setting one of the switches to Off means "1", On means "0". + +The node ID is the sum of the values of all switches set to "1" +These values are:: + + Switch | Value + -------|------- + 1 | 1 + 2 | 2 + 3 | 4 + 4 | 8 + 5 | 16 + 6 | 32 + 7 | 64 + 8 | 128 + + +Setting the I/O Base Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The last three switches in switch block SW1 are used to select one +of eight possible I/O Base addresses using the following table:: + + + Switch | Hex I/O + 6 7 8 | Address + ------------|-------- + ON ON ON | 260 + OFF ON ON | 290 + ON OFF ON | 2E0 (Manufacturer's default) + OFF OFF ON | 2F0 + ON ON OFF | 300 + OFF ON OFF | 350 + ON OFF OFF | 380 + OFF OFF OFF | 3E0 + + +Setting the Base Memory (RAM) buffer Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The memory buffer (RAM) requires 2K. The base of this buffer can be +located in any of eight positions. The address of the Boot Prom is +memory base + 0x2000. + +Jumpers 3-5 of switch block SW1 select the Memory Base address. + +:: + + Switch | Hex RAM | Hex ROM + 1 2 3 4 5 | Address | Address *) + --------------------|---------|----------- + ON ON ON ON ON | C0000 | C2000 + ON ON OFF ON ON | C4000 | C6000 + ON ON ON OFF ON | CC000 | CE000 + ON ON OFF OFF ON | D0000 | D2000 (Manufacturer's default) + ON ON ON ON OFF | D4000 | D6000 + ON ON OFF ON OFF | D8000 | DA000 + ON ON ON OFF OFF | DC000 | DE000 + ON ON OFF OFF OFF | E0000 | E2000 + + *) To enable the Boot ROM set the switch 8 of switch block SW3 to position ON. + +The switches 1 and 2 probably add 0x0800 and 0x1000 to RAM base address. + + +Setting the Interrupt Line +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Switches 1-5 of the switch block SW3 control the IRQ level:: + + Jumper | IRQ + 1 2 3 4 5 | + ---------------------------- + ON OFF OFF OFF OFF | 3 + OFF ON OFF OFF OFF | 4 + OFF OFF ON OFF OFF | 5 + OFF OFF OFF ON OFF | 7 + OFF OFF OFF OFF ON | 2 + + +Setting the Timeout Parameters +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The switches 6-7 of the switch block SW3 are used to determine the timeout +parameters. These two switches are normally left in the OFF position. + + +Topware +======= + +8-bit card, TA-ARC/10 +--------------------- + + - from Vojtech Pavlik + +This is another very similar 90C65 card. Most of the switches and jumpers +are the same as on other clones. + +:: + + _____________________________________________________________________ + | ___________ | | ______ | + | |SW2 NODE ID| | | | XTAL | | + | |___________| | Hybrid IC | |______| | + | ___________ | | __| + | |SW1 MEM+I/O| |_________________________| LED1|__|) + | |___________| 1 2 | + | J3 |o|o| TIMEOUT ______| + | ______________ |o|o| | | + | | | ___________________ | RJ | + | > EPROM SOCKET | | \ |------| + |J2 |______________| | | | | + ||o| | | |______| + ||o| ROM ENABLE | SMC | _________ | + | _____________ | 90C65 | |_________| _____| + | | | | | | |___ + | > RAM (2k) | | | | BNC |___| + | |_____________| | | |_____| + | |____________________| | + | ________ IRQ 2 3 4 5 7 ___________ | + ||________| |o|o|o|o|o| |___________| | + |________ J1|o|o|o|o|o| ______________| + | | + |_____________________________________________| + +Legend:: + + 90C65 ARCNET Chip + XTAL 20 MHz Crystal + SW1 1-5 Base Memory Address Select + 6-8 Base I/O Address Select + SW2 1-8 Node ID Select (ID0-ID7) + J1 IRQ Select + J2 ROM Enable + J3 Extra Timeout + LED1 Activity LED + BNC Coax connector (BUS ARCnet) + RJ Twisted Pair Connector (daisy chain) + + +Setting the Node ID +^^^^^^^^^^^^^^^^^^^ + +The eight switches in SW2 are used to set the node ID. Each node attached to +the network must have an unique node ID which must not be 0. Switch 1 (ID0) +serves as the least significant bit (LSB). + +Setting one of the switches to Off means "1", On means "0". + +The node ID is the sum of the values of all switches set to "1" +These values are:: + + Switch | Label | Value + -------|-------|------- + 1 | ID0 | 1 + 2 | ID1 | 2 + 3 | ID2 | 4 + 4 | ID3 | 8 + 5 | ID4 | 16 + 6 | ID5 | 32 + 7 | ID6 | 64 + 8 | ID7 | 128 + +Setting the I/O Base Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The last three switches in switch block SW1 are used to select one +of eight possible I/O Base addresses using the following table:: + + + Switch | Hex I/O + 6 7 8 | Address + ------------|-------- + ON ON ON | 260 (Manufacturer's default) + OFF ON ON | 290 + ON OFF ON | 2E0 + OFF OFF ON | 2F0 + ON ON OFF | 300 + OFF ON OFF | 350 + ON OFF OFF | 380 + OFF OFF OFF | 3E0 + + +Setting the Base Memory (RAM) buffer Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The memory buffer (RAM) requires 2K. The base of this buffer can be +located in any of eight positions. The address of the Boot Prom is +memory base + 0x2000. + +Jumpers 3-5 of switch block SW1 select the Memory Base address. + +:: + + Switch | Hex RAM | Hex ROM + 1 2 3 4 5 | Address | Address *) + --------------------|---------|----------- + ON ON ON ON ON | C0000 | C2000 + ON ON OFF ON ON | C4000 | C6000 (Manufacturer's default) + ON ON ON OFF ON | CC000 | CE000 + ON ON OFF OFF ON | D0000 | D2000 + ON ON ON ON OFF | D4000 | D6000 + ON ON OFF ON OFF | D8000 | DA000 + ON ON ON OFF OFF | DC000 | DE000 + ON ON OFF OFF OFF | E0000 | E2000 + + *) To enable the Boot ROM short the jumper J2. + +The jumpers 1 and 2 probably add 0x0800 and 0x1000 to RAM address. + + +Setting the Interrupt Line +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Jumpers 1-5 of the jumper block J1 control the IRQ level. ON means +shorted, OFF means open:: + + Jumper | IRQ + 1 2 3 4 5 | + ---------------------------- + ON OFF OFF OFF OFF | 2 + OFF ON OFF OFF OFF | 3 + OFF OFF ON OFF OFF | 4 + OFF OFF OFF ON OFF | 5 + OFF OFF OFF OFF ON | 7 + + +Setting the Timeout Parameters +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The jumpers J3 are used to set the timeout parameters. These two +jumpers are normally left open. + +Thomas-Conrad +============= + +Model #500-6242-0097 REV A (8-bit card) +--------------------------------------- + + - from Lars Karlsson <100617.3473@compuserve.com> + +:: + + ________________________________________________________ + | ________ ________ |_____ + | |........| |........| | + | |________| |________| ___| + | SW 3 SW 1 | | + | Base I/O Base Addr. Station | | + | address | | + | ______ switch | | + | | | | | + | | | |___| + | | | ______ |___._ + | |______| |______| ____| BNC + | Jumper- _____| Connector + | Main chip block _ __| ' + | | | | RJ Connector + | |_| | with 110 Ohm + | |__ Terminator + | ___________ __| + | |...........| | RJ-jack + | |...........| _____ | (unused) + | |___________| |_____| |__ + | Boot PROM socket IRQ-jumpers |_ Diagnostic + |________ __ _| LED (red) + | | | | | | | | | | | | | | | | | | | | | | + | | | | | | | | | | | | | | | | | | | | |________| + | + | + +And here are the settings for some of the switches and jumpers on the cards. + +:: + + I/O + + 1 2 3 4 5 6 7 8 + + 2E0----- 0 0 0 1 0 0 0 1 + 2F0----- 0 0 0 1 0 0 0 0 + 300----- 0 0 0 0 1 1 1 1 + 350----- 0 0 0 0 1 1 1 0 + +"0" in the above example means switch is off "1" means that it is on. + +:: + + ShMem address. + + 1 2 3 4 5 6 7 8 + + CX00--0 0 1 1 | | | + DX00--0 0 1 0 | + X000--------- 1 1 | + X400--------- 1 0 | + X800--------- 0 1 | + XC00--------- 0 0 + ENHANCED----------- 1 + COMPATIBLE--------- 0 + +:: + + IRQ + + + 3 4 5 7 2 + . . . . . + . . . . . + + +There is a DIP-switch with 8 switches, used to set the shared memory address +to be used. The first 6 switches set the address, the 7th doesn't have any +function, and the 8th switch is used to select "compatible" or "enhanced". +When I got my two cards, one of them had this switch set to "enhanced". That +card didn't work at all, it wasn't even recognized by the driver. The other +card had this switch set to "compatible" and it behaved absolutely normally. I +guess that the switch on one of the cards, must have been changed accidentally +when the card was taken out of its former host. The question remains +unanswered, what is the purpose of the "enhanced" position? + +[Avery's note: "enhanced" probably either disables shared memory (use IO +ports instead) or disables IO ports (use memory addresses instead). This +varies by the type of card involved. I fail to see how either of these +enhance anything. Send me more detailed information about this mode, or +just use "compatible" mode instead.] + +Waterloo Microsystems Inc. ?? +============================= + +8-bit card (C) 1985 +------------------- + - from Robert Michael Best + +[Avery's note: these don't work with my driver for some reason. These cards +SEEM to have settings similar to the PDI508Plus, which is +software-configured and doesn't work with my driver either. The "Waterloo +chip" is a boot PROM, probably designed specifically for the University of +Waterloo. If you have any further information about this card, please +e-mail me.] + +The probe has not been able to detect the card on any of the J2 settings, +and I tried them again with the "Waterloo" chip removed. + +:: + + _____________________________________________________________________ + | \/ \/ ___ __ __ | + | C4 C4 |^| | M || ^ ||^| | + | -- -- |_| | 5 || || | C3 | + | \/ \/ C10 |___|| ||_| | + | C4 C4 _ _ | | ?? | + | -- -- | \/ || | | + | | || | | + | | || C1 | | + | | || | \/ _____| + | | C6 || | C9 | |___ + | | || | -- | BNC |___| + | | || | >C7| |_____| + | | || | | + | __ __ |____||_____| 1 2 3 6 | + || ^ | >C4| |o|o|o|o|o|o| J2 >C4| | + || | |o|o|o|o|o|o| | + || C2 | >C4| >C4| | + || | >C8| | + || | 2 3 4 5 6 7 IRQ >C4| | + ||_____| |o|o|o|o|o|o| J3 | + |_______ |o|o|o|o|o|o| _______________| + | | + |_____________________________________________| + + C1 -- "COM9026 + SMC 8638" + In a chip socket. + + C2 -- "@Copyright + Waterloo Microsystems Inc. + 1985" + In a chip Socket with info printed on a label covering a round window + showing the circuit inside. (The window indicates it is an EPROM chip.) + + C3 -- "COM9032 + SMC 8643" + In a chip socket. + + C4 -- "74LS" + 9 total no sockets. + + M5 -- "50006-136 + 20.000000 MHZ + MTQ-T1-S3 + 0 M-TRON 86-40" + Metallic case with 4 pins, no socket. + + C6 -- "MOSTEK@TC8643 + MK6116N-20 + MALAYSIA" + No socket. + + C7 -- No stamp or label but in a 20 pin chip socket. + + C8 -- "PAL10L8CN + 8623" + In a 20 pin socket. + + C9 -- "PAl16R4A-2CN + 8641" + In a 20 pin socket. + + C10 -- "M8640 + NMC + 9306N" + In an 8 pin socket. + + ?? -- Some components on a smaller board and attached with 20 pins all + along the side closest to the BNC connector. The are coated in a dark + resin. + +On the board there are two jumper banks labeled J2 and J3. The +manufacturer didn't put a J1 on the board. The two boards I have both +came with a jumper box for each bank. + +:: + + J2 -- Numbered 1 2 3 4 5 6. + 4 and 5 are not stamped due to solder points. + + J3 -- IRQ 2 3 4 5 6 7 + +The board itself has a maple leaf stamped just above the irq jumpers +and "-2 46-86" beside C2. Between C1 and C6 "ASS 'Y 300163" and "@1986 +CORMAN CUSTOM ELECTRONICS CORP." stamped just below the BNC connector. +Below that "MADE IN CANADA" + +No Name +======= + +8-bit cards, 16-bit cards +------------------------- + + - from Juergen Seifert + +I have named this ARCnet card "NONAME", since there is no name of any +manufacturer on the Installation manual nor on the shipping box. The only +hint to the existence of a manufacturer at all is written in copper, +it is "Made in Taiwan" + +This description has been written by Juergen Seifert +using information from the Original + + "ARCnet Installation Manual" + +:: + + ________________________________________________________________ + | |STAR| BUS| T/P| | + | |____|____|____| | + | _____________________ | + | | | | + | | | | + | | | | + | | SMC | | + | | | | + | | COM90C65 | | + | | | | + | | | | + | |__________-__________| | + | _____| + | _______________ | CN | + | | PROM | |_____| + | > SOCKET | | + | |_______________| 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 | + | _______________ _______________ | + | |o|o|o|o|o|o|o|o| | SW1 || SW2 || + | |o|o|o|o|o|o|o|o| |_______________||_______________|| + |___ 2 3 4 5 7 E E R Node ID IOB__|__MEM____| + | \ IRQ / T T O | + |__________________1_2_M______________________| + +Legend:: + + COM90C65: ARCnet Probe + S1 1-8: Node ID Select + S2 1-3: I/O Base Address Select + 4-6: Memory Base Address Select + 7-8: RAM Offset Select + ET1, ET2 Extended Timeout Select + ROM ROM Enable Select + CN RG62 Coax Connector + STAR| BUS | T/P Three fields for placing a sign (colored circle) + indicating the topology of the card + +Setting one of the switches to Off means "1", On means "0". + + +Setting the Node ID +^^^^^^^^^^^^^^^^^^^ + +The eight switches in group SW1 are used to set the node ID. +Each node attached to the network must have an unique node ID which +must be different from 0. +Switch 8 serves as the least significant bit (LSB). + +The node ID is the sum of the values of all switches set to "1" +These values are:: + + Switch | Value + -------|------- + 8 | 1 + 7 | 2 + 6 | 4 + 5 | 8 + 4 | 16 + 3 | 32 + 2 | 64 + 1 | 128 + +Some Examples:: + + Switch | Hex | Decimal + 1 2 3 4 5 6 7 8 | Node ID | Node ID + ----------------|---------|--------- + 0 0 0 0 0 0 0 0 | not allowed + 0 0 0 0 0 0 0 1 | 1 | 1 + 0 0 0 0 0 0 1 0 | 2 | 2 + 0 0 0 0 0 0 1 1 | 3 | 3 + . . . | | + 0 1 0 1 0 1 0 1 | 55 | 85 + . . . | | + 1 0 1 0 1 0 1 0 | AA | 170 + . . . | | + 1 1 1 1 1 1 0 1 | FD | 253 + 1 1 1 1 1 1 1 0 | FE | 254 + 1 1 1 1 1 1 1 1 | FF | 255 + + +Setting the I/O Base Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The first three switches in switch group SW2 are used to select one +of eight possible I/O Base addresses using the following table:: + + Switch | Hex I/O + 1 2 3 | Address + ------------|-------- + ON ON ON | 260 + ON ON OFF | 290 + ON OFF ON | 2E0 (Manufacturer's default) + ON OFF OFF | 2F0 + OFF ON ON | 300 + OFF ON OFF | 350 + OFF OFF ON | 380 + OFF OFF OFF | 3E0 + + +Setting the Base Memory (RAM) buffer Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The memory buffer requires 2K of a 16K block of RAM. The base of this +16K block can be located in any of eight positions. +Switches 4-6 of switch group SW2 select the Base of the 16K block. +Within that 16K address space, the buffer may be assigned any one of four +positions, determined by the offset, switches 7 and 8 of group SW2. + +:: + + Switch | Hex RAM | Hex ROM + 4 5 6 7 8 | Address | Address *) + -----------|---------|----------- + 0 0 0 0 0 | C0000 | C2000 + 0 0 0 0 1 | C0800 | C2000 + 0 0 0 1 0 | C1000 | C2000 + 0 0 0 1 1 | C1800 | C2000 + | | + 0 0 1 0 0 | C4000 | C6000 + 0 0 1 0 1 | C4800 | C6000 + 0 0 1 1 0 | C5000 | C6000 + 0 0 1 1 1 | C5800 | C6000 + | | + 0 1 0 0 0 | CC000 | CE000 + 0 1 0 0 1 | CC800 | CE000 + 0 1 0 1 0 | CD000 | CE000 + 0 1 0 1 1 | CD800 | CE000 + | | + 0 1 1 0 0 | D0000 | D2000 (Manufacturer's default) + 0 1 1 0 1 | D0800 | D2000 + 0 1 1 1 0 | D1000 | D2000 + 0 1 1 1 1 | D1800 | D2000 + | | + 1 0 0 0 0 | D4000 | D6000 + 1 0 0 0 1 | D4800 | D6000 + 1 0 0 1 0 | D5000 | D6000 + 1 0 0 1 1 | D5800 | D6000 + | | + 1 0 1 0 0 | D8000 | DA000 + 1 0 1 0 1 | D8800 | DA000 + 1 0 1 1 0 | D9000 | DA000 + 1 0 1 1 1 | D9800 | DA000 + | | + 1 1 0 0 0 | DC000 | DE000 + 1 1 0 0 1 | DC800 | DE000 + 1 1 0 1 0 | DD000 | DE000 + 1 1 0 1 1 | DD800 | DE000 + | | + 1 1 1 0 0 | E0000 | E2000 + 1 1 1 0 1 | E0800 | E2000 + 1 1 1 1 0 | E1000 | E2000 + 1 1 1 1 1 | E1800 | E2000 + + *) To enable the 8K Boot PROM install the jumper ROM. + The default is jumper ROM not installed. + + +Setting Interrupt Request Lines (IRQ) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To select a hardware interrupt level set one (only one!) of the jumpers +IRQ2, IRQ3, IRQ4, IRQ5 or IRQ7. The manufacturer's default is IRQ2. + + +Setting the Timeouts +^^^^^^^^^^^^^^^^^^^^ + +The two jumpers labeled ET1 and ET2 are used to determine the timeout +parameters (response and reconfiguration time). Every node in a network +must be set to the same timeout values. + +:: + + ET1 ET2 | Response Time (us) | Reconfiguration Time (ms) + --------|--------------------|-------------------------- + Off Off | 78 | 840 (Default) + Off On | 285 | 1680 + On Off | 563 | 1680 + On On | 1130 | 1680 + +On means jumper installed, Off means jumper not installed + + +16-BIT ARCNET +------------- + +The manual of my 8-Bit NONAME ARCnet Card contains another description +of a 16-Bit Coax / Twisted Pair Card. This description is incomplete, +because there are missing two pages in the manual booklet. (The table +of contents reports pages ... 2-9, 2-11, 2-12, 3-1, ... but inside +the booklet there is a different way of counting ... 2-9, 2-10, A-1, +(empty page), 3-1, ..., 3-18, A-1 (again), A-2) +Also the picture of the board layout is not as good as the picture of +8-Bit card, because there isn't any letter like "SW1" written to the +picture. + +Should somebody have such a board, please feel free to complete this +description or to send a mail to me! + +This description has been written by Juergen Seifert +using information from the Original + + "ARCnet Installation Manual" + +:: + + ___________________________________________________________________ + < _________________ _________________ | + > | SW? || SW? | | + < |_________________||_________________| | + > ____________________ | + < | | | + > | | | + < | | | + > | | | + < | | | + > | | | + < | | | + > |____________________| | + < ____| + > ____________________ | | + < | | | J1 | + > | < | | + < |____________________| ? ? ? ? ? ? |____| + > |o|o|o|o|o|o| | + < |o|o|o|o|o|o| | + > | + < __ ___________| + > | | | + <____________| |_______________________________________| + + +Setting one of the switches to Off means "1", On means "0". + + +Setting the Node ID +^^^^^^^^^^^^^^^^^^^ + +The eight switches in group SW2 are used to set the node ID. +Each node attached to the network must have an unique node ID which +must be different from 0. +Switch 8 serves as the least significant bit (LSB). + +The node ID is the sum of the values of all switches set to "1" +These values are:: + + Switch | Value + -------|------- + 8 | 1 + 7 | 2 + 6 | 4 + 5 | 8 + 4 | 16 + 3 | 32 + 2 | 64 + 1 | 128 + +Some Examples:: + + Switch | Hex | Decimal + 1 2 3 4 5 6 7 8 | Node ID | Node ID + ----------------|---------|--------- + 0 0 0 0 0 0 0 0 | not allowed + 0 0 0 0 0 0 0 1 | 1 | 1 + 0 0 0 0 0 0 1 0 | 2 | 2 + 0 0 0 0 0 0 1 1 | 3 | 3 + . . . | | + 0 1 0 1 0 1 0 1 | 55 | 85 + . . . | | + 1 0 1 0 1 0 1 0 | AA | 170 + . . . | | + 1 1 1 1 1 1 0 1 | FD | 253 + 1 1 1 1 1 1 1 0 | FE | 254 + 1 1 1 1 1 1 1 1 | FF | 255 + + +Setting the I/O Base Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The first three switches in switch group SW1 are used to select one +of eight possible I/O Base addresses using the following table:: + + Switch | Hex I/O + 3 2 1 | Address + ------------|-------- + ON ON ON | 260 + ON ON OFF | 290 + ON OFF ON | 2E0 (Manufacturer's default) + ON OFF OFF | 2F0 + OFF ON ON | 300 + OFF ON OFF | 350 + OFF OFF ON | 380 + OFF OFF OFF | 3E0 + + +Setting the Base Memory (RAM) buffer Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The memory buffer requires 2K of a 16K block of RAM. The base of this +16K block can be located in any of eight positions. +Switches 6-8 of switch group SW1 select the Base of the 16K block. +Within that 16K address space, the buffer may be assigned any one of four +positions, determined by the offset, switches 4 and 5 of group SW1:: + + Switch | Hex RAM | Hex ROM + 8 7 6 5 4 | Address | Address + -----------|---------|----------- + 0 0 0 0 0 | C0000 | C2000 + 0 0 0 0 1 | C0800 | C2000 + 0 0 0 1 0 | C1000 | C2000 + 0 0 0 1 1 | C1800 | C2000 + | | + 0 0 1 0 0 | C4000 | C6000 + 0 0 1 0 1 | C4800 | C6000 + 0 0 1 1 0 | C5000 | C6000 + 0 0 1 1 1 | C5800 | C6000 + | | + 0 1 0 0 0 | CC000 | CE000 + 0 1 0 0 1 | CC800 | CE000 + 0 1 0 1 0 | CD000 | CE000 + 0 1 0 1 1 | CD800 | CE000 + | | + 0 1 1 0 0 | D0000 | D2000 (Manufacturer's default) + 0 1 1 0 1 | D0800 | D2000 + 0 1 1 1 0 | D1000 | D2000 + 0 1 1 1 1 | D1800 | D2000 + | | + 1 0 0 0 0 | D4000 | D6000 + 1 0 0 0 1 | D4800 | D6000 + 1 0 0 1 0 | D5000 | D6000 + 1 0 0 1 1 | D5800 | D6000 + | | + 1 0 1 0 0 | D8000 | DA000 + 1 0 1 0 1 | D8800 | DA000 + 1 0 1 1 0 | D9000 | DA000 + 1 0 1 1 1 | D9800 | DA000 + | | + 1 1 0 0 0 | DC000 | DE000 + 1 1 0 0 1 | DC800 | DE000 + 1 1 0 1 0 | DD000 | DE000 + 1 1 0 1 1 | DD800 | DE000 + | | + 1 1 1 0 0 | E0000 | E2000 + 1 1 1 0 1 | E0800 | E2000 + 1 1 1 1 0 | E1000 | E2000 + 1 1 1 1 1 | E1800 | E2000 + + +Setting Interrupt Request Lines (IRQ) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +?????????????????????????????????????? + + +Setting the Timeouts +^^^^^^^^^^^^^^^^^^^^ + +?????????????????????????????????????? + + +8-bit cards ("Made in Taiwan R.O.C.") +------------------------------------- + + - from Vojtech Pavlik + +I have named this ARCnet card "NONAME", since I got only the card with +no manual at all and the only text identifying the manufacturer is +"MADE IN TAIWAN R.O.C" printed on the card. + +:: + + ____________________________________________________________ + | 1 2 3 4 5 6 7 8 | + | |o|o| JP1 o|o|o|o|o|o|o|o| ON | + | + o|o|o|o|o|o|o|o| ___| + | _____________ o|o|o|o|o|o|o|o| OFF _____ | | ID7 + | | | SW1 | | | | ID6 + | > RAM (2k) | ____________________ | H | | S | ID5 + | |_____________| | || y | | W | ID4 + | | || b | | 2 | ID3 + | | || r | | | ID2 + | | || i | | | ID1 + | | 90C65 || d | |___| ID0 + | SW3 | || | | + | |o|o|o|o|o|o|o|o| ON | || I | | + | |o|o|o|o|o|o|o|o| | || C | | + | |o|o|o|o|o|o|o|o| OFF |____________________|| | _____| + | 1 2 3 4 5 6 7 8 | | | |___ + | ______________ | | | BNC |___| + | | | |_____| |_____| + | > EPROM SOCKET | | + | |______________| | + | ______________| + | | + |_____________________________________________| + +Legend:: + + 90C65 ARCNET Chip + SW1 1-5: Base Memory Address Select + 6-8: Base I/O Address Select + SW2 1-8: Node ID Select (ID0-ID7) + SW3 1-5: IRQ Select + 6-7: Extra Timeout + 8 : ROM Enable + JP1 Led connector + BNC Coax connector + +Although the jumpers SW1 and SW3 are marked SW, not JP, they are jumpers, not +switches. + +Setting the jumpers to ON means connecting the upper two pins, off the bottom +two - or - in case of IRQ setting, connecting none of them at all. + +Setting the Node ID +^^^^^^^^^^^^^^^^^^^ + +The eight switches in SW2 are used to set the node ID. Each node attached +to the network must have an unique node ID which must not be 0. +Switch 1 (ID0) serves as the least significant bit (LSB). + +Setting one of the switches to Off means "1", On means "0". + +The node ID is the sum of the values of all switches set to "1" +These values are:: + + Switch | Label | Value + -------|-------|------- + 1 | ID0 | 1 + 2 | ID1 | 2 + 3 | ID2 | 4 + 4 | ID3 | 8 + 5 | ID4 | 16 + 6 | ID5 | 32 + 7 | ID6 | 64 + 8 | ID7 | 128 + +Some Examples:: + + Switch | Hex | Decimal + 8 7 6 5 4 3 2 1 | Node ID | Node ID + ----------------|---------|--------- + 0 0 0 0 0 0 0 0 | not allowed + 0 0 0 0 0 0 0 1 | 1 | 1 + 0 0 0 0 0 0 1 0 | 2 | 2 + 0 0 0 0 0 0 1 1 | 3 | 3 + . . . | | + 0 1 0 1 0 1 0 1 | 55 | 85 + . . . | | + 1 0 1 0 1 0 1 0 | AA | 170 + . . . | | + 1 1 1 1 1 1 0 1 | FD | 253 + 1 1 1 1 1 1 1 0 | FE | 254 + 1 1 1 1 1 1 1 1 | FF | 255 + + +Setting the I/O Base Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The last three switches in switch block SW1 are used to select one +of eight possible I/O Base addresses using the following table:: + + + Switch | Hex I/O + 6 7 8 | Address + ------------|-------- + ON ON ON | 260 + OFF ON ON | 290 + ON OFF ON | 2E0 (Manufacturer's default) + OFF OFF ON | 2F0 + ON ON OFF | 300 + OFF ON OFF | 350 + ON OFF OFF | 380 + OFF OFF OFF | 3E0 + + +Setting the Base Memory (RAM) buffer Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The memory buffer (RAM) requires 2K. The base of this buffer can be +located in any of eight positions. The address of the Boot Prom is +memory base + 0x2000. + +Jumpers 3-5 of jumper block SW1 select the Memory Base address. + +:: + + Switch | Hex RAM | Hex ROM + 1 2 3 4 5 | Address | Address *) + --------------------|---------|----------- + ON ON ON ON ON | C0000 | C2000 + ON ON OFF ON ON | C4000 | C6000 + ON ON ON OFF ON | CC000 | CE000 + ON ON OFF OFF ON | D0000 | D2000 (Manufacturer's default) + ON ON ON ON OFF | D4000 | D6000 + ON ON OFF ON OFF | D8000 | DA000 + ON ON ON OFF OFF | DC000 | DE000 + ON ON OFF OFF OFF | E0000 | E2000 + + *) To enable the Boot ROM set the jumper 8 of jumper block SW3 to position ON. + +The jumpers 1 and 2 probably add 0x0800, 0x1000 and 0x1800 to RAM adders. + +Setting the Interrupt Line +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Jumpers 1-5 of the jumper block SW3 control the IRQ level:: + + Jumper | IRQ + 1 2 3 4 5 | + ---------------------------- + ON OFF OFF OFF OFF | 2 + OFF ON OFF OFF OFF | 3 + OFF OFF ON OFF OFF | 4 + OFF OFF OFF ON OFF | 5 + OFF OFF OFF OFF ON | 7 + + +Setting the Timeout Parameters +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The jumpers 6-7 of the jumper block SW3 are used to determine the timeout +parameters. These two jumpers are normally left in the OFF position. + + + +(Generic Model 9058) +-------------------- + - from Andrew J. Kroll + - Sorry this sat in my to-do box for so long, Andrew! (yikes - over a + year!) + +:: + + _____ + | < + | .---' + ________________________________________________________________ | | + | | SW2 | | | + | ___________ |_____________| | | + | | | 1 2 3 4 5 6 ___| | + | > 6116 RAM | _________ 8 | | | + | |___________| |20MHzXtal| 7 | | | + | |_________| __________ 6 | S | | + | 74LS373 | |- 5 | W | | + | _________ | E |- 4 | | | + | >_______| ______________|..... P |- 3 | 3 | | + | | | : O |- 2 | | | + | | | : X |- 1 |___| | + | ________________ | | : Y |- | | + | | SW1 | | SL90C65 | : |- | | + | |________________| | | : B |- | | + | 1 2 3 4 5 6 7 8 | | : O |- | | + | |_________o____|..../ A |- _______| | + | ____________________ | R |- | |------, + | | | | D |- | BNC | # | + | > 2764 PROM SOCKET | |__________|- |_______|------' + | |____________________| _________ | | + | >________| <- 74LS245 | | + | | | + |___ ______________| | + |H H H H H H H H H H H H H H H H H H H H H H H| | | + |U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U| | | + \| + +Legend:: + + SL90C65 ARCNET Controller / Transceiver /Logic + SW1 1-5: IRQ Select + 6: ET1 + 7: ET2 + 8: ROM ENABLE + SW2 1-3: Memory Buffer/PROM Address + 3-6: I/O Address Map + SW3 1-8: Node ID Select + BNC BNC RG62/U Connection + *I* have had success using RG59B/U with *NO* terminators! + What gives?! + +SW1: Timeouts, Interrupt and ROM +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To select a hardware interrupt level set one (only one!) of the dip switches +up (on) SW1...(switches 1-5) +IRQ3, IRQ4, IRQ5, IRQ7, IRQ2. The Manufacturer's default is IRQ2. + +The switches on SW1 labeled EXT1 (switch 6) and EXT2 (switch 7) +are used to determine the timeout parameters. These two dip switches +are normally left off (down). + + To enable the 8K Boot PROM position SW1 switch 8 on (UP) labeled ROM. + The default is jumper ROM not installed. + + +Setting the I/O Base Address +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The last three switches in switch group SW2 are used to select one +of eight possible I/O Base addresses using the following table:: + + + Switch | Hex I/O + 4 5 6 | Address + -------|-------- + 0 0 0 | 260 + 0 0 1 | 290 + 0 1 0 | 2E0 (Manufacturer's default) + 0 1 1 | 2F0 + 1 0 0 | 300 + 1 0 1 | 350 + 1 1 0 | 380 + 1 1 1 | 3E0 + + +Setting the Base Memory Address (RAM & ROM) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The memory buffer requires 2K of a 16K block of RAM. The base of this +16K block can be located in any of eight positions. +Switches 1-3 of switch group SW2 select the Base of the 16K block. +(0 = DOWN, 1 = UP) +I could, however, only verify two settings... + + +:: + + Switch| Hex RAM | Hex ROM + 1 2 3 | Address | Address + ------|---------|----------- + 0 0 0 | E0000 | E2000 + 0 0 1 | D0000 | D2000 (Manufacturer's default) + 0 1 0 | ????? | ????? + 0 1 1 | ????? | ????? + 1 0 0 | ????? | ????? + 1 0 1 | ????? | ????? + 1 1 0 | ????? | ????? + 1 1 1 | ????? | ????? + + +Setting the Node ID +^^^^^^^^^^^^^^^^^^^ + +The eight switches in group SW3 are used to set the node ID. +Each node attached to the network must have an unique node ID which +must be different from 0. +Switch 1 serves as the least significant bit (LSB). +switches in the DOWN position are OFF (0) and in the UP position are ON (1) + +The node ID is the sum of the values of all switches set to "1" +These values are:: + + Switch | Value + -------|------- + 1 | 1 + 2 | 2 + 3 | 4 + 4 | 8 + 5 | 16 + 6 | 32 + 7 | 64 + 8 | 128 + +Some Examples:: + + Switch# | Hex | Decimal + 8 7 6 5 4 3 2 1 | Node ID | Node ID + ----------------|---------|--------- + 0 0 0 0 0 0 0 0 | not allowed <-. + 0 0 0 0 0 0 0 1 | 1 | 1 | + 0 0 0 0 0 0 1 0 | 2 | 2 | + 0 0 0 0 0 0 1 1 | 3 | 3 | + . . . | | | + 0 1 0 1 0 1 0 1 | 55 | 85 | + . . . | | + Don't use 0 or 255! + 1 0 1 0 1 0 1 0 | AA | 170 | + . . . | | | + 1 1 1 1 1 1 0 1 | FD | 253 | + 1 1 1 1 1 1 1 0 | FE | 254 | + 1 1 1 1 1 1 1 1 | FF | 255 <-' + + +Tiara +===== + +(model unknown) +--------------- + + - from Christoph Lameter + + +Here is information about my card as far as I could figure it out:: + + + ----------------------------------------------- tiara + Tiara LanCard of Tiara Computer Systems. + + +----------------------------------------------+ + ! ! Transmitter Unit ! ! + ! +------------------+ ------- + ! MEM Coax Connector + ! ROM 7654321 <- I/O ------- + ! : : +--------+ ! + ! : : ! 90C66LJ! +++ + ! : : ! ! !D Switch to set + ! : : ! ! !I the Nodenumber + ! : : +--------+ !P + ! !++ + ! 234567 <- IRQ ! + +------------!!!!!!!!!!!!!!!!!!!!!!!!--------+ + !!!!!!!!!!!!!!!!!!!!!!!! + +- 0 = Jumper Installed +- 1 = Open + +Top Jumper line Bit 7 = ROM Enable 654=Memory location 321=I/O + +Settings for Memory Location (Top Jumper Line) + +=== ================ +456 Address selected +=== ================ +000 C0000 +001 C4000 +010 CC000 +011 D0000 +100 D4000 +101 D8000 +110 DC000 +111 E0000 +=== ================ + +Settings for I/O Address (Top Jumper Line) + +=== ==== +123 Port +=== ==== +000 260 +001 290 +010 2E0 +011 2F0 +100 300 +101 350 +110 380 +111 3E0 +=== ==== + +Settings for IRQ Selection (Lower Jumper Line) + +====== ===== +234567 +====== ===== +011111 IRQ 2 +101111 IRQ 3 +110111 IRQ 4 +111011 IRQ 5 +111110 IRQ 7 +====== ===== + +Other Cards +=========== + +I have no information on other models of ARCnet cards at the moment. Please +send any and all info to: + + apenwarr@worldvisions.ca + +Thanks. diff --git a/Documentation/networking/arcnet-hardware.txt b/Documentation/networking/arcnet-hardware.txt deleted file mode 100644 index 731de411513c..000000000000 --- a/Documentation/networking/arcnet-hardware.txt +++ /dev/null @@ -1,3133 +0,0 @@ - ------------------------------------------------------------------------------ -1) This file is a supplement to arcnet.txt. Please read that for general - driver configuration help. ------------------------------------------------------------------------------ -2) This file is no longer Linux-specific. It should probably be moved out of - the kernel sources. Ideas? ------------------------------------------------------------------------------ - -Because so many people (myself included) seem to have obtained ARCnet cards -without manuals, this file contains a quick introduction to ARCnet hardware, -some cabling tips, and a listing of all jumper settings I can find. Please -e-mail apenwarr@worldvisions.ca with any settings for your particular card, -or any other information you have! - - -INTRODUCTION TO ARCNET ----------------------- - -ARCnet is a network type which works in a way similar to popular Ethernet -networks but which is also different in some very important ways. - -First of all, you can get ARCnet cards in at least two speeds: 2.5 Mbps -(slower than Ethernet) and 100 Mbps (faster than normal Ethernet). In fact, -there are others as well, but these are less common. The different hardware -types, as far as I'm aware, are not compatible and so you cannot wire a -100 Mbps card to a 2.5 Mbps card, and so on. From what I hear, my driver does -work with 100 Mbps cards, but I haven't been able to verify this myself, -since I only have the 2.5 Mbps variety. It is probably not going to saturate -your 100 Mbps card. Stop complaining. :) - -You also cannot connect an ARCnet card to any kind of Ethernet card and -expect it to work. - -There are two "types" of ARCnet - STAR topology and BUS topology. This -refers to how the cards are meant to be wired together. According to most -available documentation, you can only connect STAR cards to STAR cards and -BUS cards to BUS cards. That makes sense, right? Well, it's not quite -true; see below under "Cabling." - -Once you get past these little stumbling blocks, ARCnet is actually quite a -well-designed standard. It uses something called "modified token passing" -which makes it completely incompatible with so-called "Token Ring" cards, -but which makes transfers much more reliable than Ethernet does. In fact, -ARCnet will guarantee that a packet arrives safely at the destination, and -even if it can't possibly be delivered properly (ie. because of a cable -break, or because the destination computer does not exist) it will at least -tell the sender about it. - -Because of the carefully defined action of the "token", it will always make -a pass around the "ring" within a maximum length of time. This makes it -useful for realtime networks. - -In addition, all known ARCnet cards have an (almost) identical programming -interface. This means that with one ARCnet driver you can support any -card, whereas with Ethernet each manufacturer uses what is sometimes a -completely different programming interface, leading to a lot of different, -sometimes very similar, Ethernet drivers. Of course, always using the same -programming interface also means that when high-performance hardware -facilities like PCI bus mastering DMA appear, it's hard to take advantage of -them. Let's not go into that. - -One thing that makes ARCnet cards difficult to program for, however, is the -limit on their packet sizes; standard ARCnet can only send packets that are -up to 508 bytes in length. This is smaller than the Internet "bare minimum" -of 576 bytes, let alone the Ethernet MTU of 1500. To compensate, an extra -level of encapsulation is defined by RFC1201, which I call "packet -splitting," that allows "virtual packets" to grow as large as 64K each, -although they are generally kept down to the Ethernet-style 1500 bytes. - -For more information on the advantages and disadvantages (mostly the -advantages) of ARCnet networks, you might try the "ARCnet Trade Association" -WWW page: - http://www.arcnet.com - - -CABLING ARCNET NETWORKS ------------------------ - -This section was rewritten by - Vojtech Pavlik -using information from several people, including: - Avery Pennraun - Stephen A. Wood - John Paul Morrison - Joachim Koenig -and Avery touched it up a bit, at Vojtech's request. - -ARCnet (the classic 2.5 Mbps version) can be connected by two different -types of cabling: coax and twisted pair. The other ARCnet-type networks -(100 Mbps TCNS and 320 kbps - 32 Mbps ARCnet Plus) use different types of -cabling (Type1, Fiber, C1, C4, C5). - -For a coax network, you "should" use 93 Ohm RG-62 cable. But other cables -also work fine, because ARCnet is a very stable network. I personally use 75 -Ohm TV antenna cable. - -Cards for coax cabling are shipped in two different variants: for BUS and -STAR network topologies. They are mostly the same. The only difference -lies in the hybrid chip installed. BUS cards use high impedance output, -while STAR use low impedance. Low impedance card (STAR) is electrically -equal to a high impedance one with a terminator installed. - -Usually, the ARCnet networks are built up from STAR cards and hubs. There -are two types of hubs - active and passive. Passive hubs are small boxes -with four BNC connectors containing four 47 Ohm resistors: - - | | wires - R + junction --R-+-R- R 47 Ohm resistors - R - | - -The shielding is connected together. Active hubs are much more complicated; -they are powered and contain electronics to amplify the signal and send it -to other segments of the net. They usually have eight connectors. Active -hubs come in two variants - dumb and smart. The dumb variant just -amplifies, but the smart one decodes to digital and encodes back all packets -coming through. This is much better if you have several hubs in the net, -since many dumb active hubs may worsen the signal quality. - -And now to the cabling. What you can connect together: - -1. A card to a card. This is the simplest way of creating a 2-computer - network. - -2. A card to a passive hub. Remember that all unused connectors on the hub - must be properly terminated with 93 Ohm (or something else if you don't - have the right ones) terminators. - (Avery's note: oops, I didn't know that. Mine (TV cable) works - anyway, though.) - -3. A card to an active hub. Here is no need to terminate the unused - connectors except some kind of aesthetic feeling. But, there may not be - more than eleven active hubs between any two computers. That of course - doesn't limit the number of active hubs on the network. - -4. An active hub to another. - -5. An active hub to passive hub. - -Remember that you cannot connect two passive hubs together. The power loss -implied by such a connection is too high for the net to operate reliably. - -An example of a typical ARCnet network: - - R S - STAR type card - S------H--------A-------S R - Terminator - | | H - Hub - | | A - Active hub - | S----H----S - S | - | - S - -The BUS topology is very similar to the one used by Ethernet. The only -difference is in cable and terminators: they should be 93 Ohm. Ethernet -uses 50 Ohm impedance. You use T connectors to put the computers on a single -line of cable, the bus. You have to put terminators at both ends of the -cable. A typical BUS ARCnet network looks like: - - RT----T------T------T------T------TR - B B B B B B - - B - BUS type card - R - Terminator - T - T connector - -But that is not all! The two types can be connected together. According to -the official documentation the only way of connecting them is using an active -hub: - - A------T------T------TR - | B B B - S---H---S - | - S - -The official docs also state that you can use STAR cards at the ends of -BUS network in place of a BUS card and a terminator: - - S------T------T------S - B B - -But, according to my own experiments, you can simply hang a BUS type card -anywhere in middle of a cable in a STAR topology network. And more - you -can use the bus card in place of any star card if you use a terminator. Then -you can build very complicated networks fulfilling all your needs! An -example: - - S - | - RT------T-------T------H------S - B B B | - | R - S------A------T-------T-------A-------H------TR - | B B | | B - | S BT | - | | | S----A-----S - S------H---A----S | | - | | S------T----H---S | - S S B R S - -A basically different cabling scheme is used with Twisted Pair cabling. Each -of the TP cards has two RJ (phone-cord style) connectors. The cards are -then daisy-chained together using a cable connecting every two neighboring -cards. The ends are terminated with RJ 93 Ohm terminators which plug into -the empty connectors of cards on the ends of the chain. An example: - - ___________ ___________ - _R_|_ _|_|_ _|_R_ - | | | | | | - |Card | |Card | |Card | - |_____| |_____| |_____| - - -There are also hubs for the TP topology. There is nothing difficult -involved in using them; you just connect a TP chain to a hub on any end or -even at both. This way you can create almost any network configuration. -The maximum of 11 hubs between any two computers on the net applies here as -well. An example: - - RP-------P--------P--------H-----P------P-----PR - | - RP-----H--------P--------H-----P------PR - | | - PR PR - - R - RJ Terminator - P - TP Card - H - TP Hub - -Like any network, ARCnet has a limited cable length. These are the maximum -cable lengths between two active ends (an active end being an active hub or -a STAR card). - - RG-62 93 Ohm up to 650 m - RG-59/U 75 Ohm up to 457 m - RG-11/U 75 Ohm up to 533 m - IBM Type 1 150 Ohm up to 200 m - IBM Type 3 100 Ohm up to 100 m - -The maximum length of all cables connected to a passive hub is limited to 65 -meters for RG-62 cabling; less for others. You can see that using passive -hubs in a large network is a bad idea. The maximum length of a single "BUS -Trunk" is about 300 meters for RG-62. The maximum distance between the two -most distant points of the net is limited to 3000 meters. The maximum length -of a TP cable between two cards/hubs is 650 meters. - - -SETTING THE JUMPERS -------------------- - -All ARCnet cards should have a total of four or five different settings: - - - the I/O address: this is the "port" your ARCnet card is on. Probed - values in the Linux ARCnet driver are only from 0x200 through 0x3F0. (If - your card has additional ones, which is possible, please tell me.) This - should not be the same as any other device on your system. According to - a doc I got from Novell, MS Windows prefers values of 0x300 or more, - eating net connections on my system (at least) otherwise. My guess is - this may be because, if your card is at 0x2E0, probing for a serial port - at 0x2E8 will reset the card and probably mess things up royally. - - Avery's favourite: 0x300. - - - the IRQ: on 8-bit cards, it might be 2 (9), 3, 4, 5, or 7. - on 16-bit cards, it might be 2 (9), 3, 4, 5, 7, or 10-15. - - Make sure this is different from any other card on your system. Note - that IRQ2 is the same as IRQ9, as far as Linux is concerned. You can - "cat /proc/interrupts" for a somewhat complete list of which ones are in - use at any given time. Here is a list of common usages from Vojtech - Pavlik : - ("Not on bus" means there is no way for a card to generate this - interrupt) - IRQ 0 - Timer 0 (Not on bus) - IRQ 1 - Keyboard (Not on bus) - IRQ 2 - IRQ Controller 2 (Not on bus, nor does interrupt the CPU) - IRQ 3 - COM2 - IRQ 4 - COM1 - IRQ 5 - FREE (LPT2 if you have it; sometimes COM3; maybe PLIP) - IRQ 6 - Floppy disk controller - IRQ 7 - FREE (LPT1 if you don't use the polling driver; PLIP) - IRQ 8 - Realtime Clock Interrupt (Not on bus) - IRQ 9 - FREE (VGA vertical sync interrupt if enabled) - IRQ 10 - FREE - IRQ 11 - FREE - IRQ 12 - FREE - IRQ 13 - Numeric Coprocessor (Not on bus) - IRQ 14 - Fixed Disk Controller - IRQ 15 - FREE (Fixed Disk Controller 2 if you have it) - - Note: IRQ 9 is used on some video cards for the "vertical retrace" - interrupt. This interrupt would have been handy for things like - video games, as it occurs exactly once per screen refresh, but - unfortunately IBM cancelled this feature starting with the original - VGA and thus many VGA/SVGA cards do not support it. For this - reason, no modern software uses this interrupt and it can almost - always be safely disabled, if your video card supports it at all. - - If your card for some reason CANNOT disable this IRQ (usually there - is a jumper), one solution would be to clip the printed circuit - contact on the board: it's the fourth contact from the left on the - back side. I take no responsibility if you try this. - - - Avery's favourite: IRQ2 (actually IRQ9). Watch that VGA, though. - - - the memory address: Unlike most cards, ARCnets use "shared memory" for - copying buffers around. Make SURE it doesn't conflict with any other - used memory in your system! - A0000 - VGA graphics memory (ok if you don't have VGA) - B0000 - Monochrome text mode - C0000 \ One of these is your VGA BIOS - usually C0000. - E0000 / - F0000 - System BIOS - - Anything less than 0xA0000 is, well, a BAD idea since it isn't above - 640k. - - Avery's favourite: 0xD0000 - - - the station address: Every ARCnet card has its own "unique" network - address from 0 to 255. Unlike Ethernet, you can set this address - yourself with a jumper or switch (or on some cards, with special - software). Since it's only 8 bits, you can only have 254 ARCnet cards - on a network. DON'T use 0 or 255, since these are reserved (although - neat stuff will probably happen if you DO use them). By the way, if you - haven't already guessed, don't set this the same as any other ARCnet on - your network! - - Avery's favourite: 3 and 4. Not that it matters. - - - There may be ETS1 and ETS2 settings. These may or may not make a - difference on your card (many manuals call them "reserved"), but are - used to change the delays used when powering up a computer on the - network. This is only necessary when wiring VERY long range ARCnet - networks, on the order of 4km or so; in any case, the only real - requirement here is that all cards on the network with ETS1 and ETS2 - jumpers have them in the same position. Chris Hindy - sent in a chart with actual values for this: - ET1 ET2 Response Time Reconfiguration Time - --- --- ------------- -------------------- - open open 74.7us 840us - open closed 283.4us 1680us - closed open 561.8us 1680us - closed closed 1118.6us 1680us - - Make sure you set ETS1 and ETS2 to the SAME VALUE for all cards on your - network. - -Also, on many cards (not mine, though) there are red and green LED's. -Vojtech Pavlik tells me this is what they mean: - GREEN RED Status - ----- --- ------ - OFF OFF Power off - OFF Short flashes Cabling problems (broken cable or not - terminated) - OFF (short) ON Card init - ON ON Normal state - everything OK, nothing - happens - ON Long flashes Data transfer - ON OFF Never happens (maybe when wrong ID) - - -The following is all the specific information people have sent me about -their own particular ARCnet cards. It is officially a mess, and contains -huge amounts of duplicated information. I have no time to fix it. If you -want to, PLEASE DO! Just send me a 'diff -u' of all your changes. - -The model # is listed right above specifics for that card, so you should be -able to use your text viewer's "search" function to find the entry you want. -If you don't KNOW what kind of card you have, try looking through the -various diagrams to see if you can tell. - -If your model isn't listed and/or has different settings, PLEASE PLEASE -tell me. I had to figure mine out without the manual, and it WASN'T FUN! - -Even if your ARCnet model isn't listed, but has the same jumpers as another -model that is, please e-mail me to say so. - -Cards Listed in this file (in this order, mostly): - - Manufacturer Model # Bits - ------------ ------- ---- - SMC PC100 8 - SMC PC110 8 - SMC PC120 8 - SMC PC130 8 - SMC PC270E 8 - SMC PC500 16 - SMC PC500Longboard 16 - SMC PC550Longboard 16 - SMC PC600 16 - SMC PC710 8 - SMC? LCS-8830(-T) 8/16 - Puredata PDI507 8 - CNet Tech CN120-Series 8 - CNet Tech CN160-Series 16 - Lantech? UM9065L chipset 8 - Acer 5210-003 8 - Datapoint? LAN-ARC-8 8 - Topware TA-ARC/10 8 - Thomas-Conrad 500-6242-0097 REV A 8 - Waterloo? (C)1985 Waterloo Micro. 8 - No Name -- 8/16 - No Name Taiwan R.O.C? 8 - No Name Model 9058 8 - Tiara Tiara Lancard? 8 - - -** SMC = Standard Microsystems Corp. -** CNet Tech = CNet Technology, Inc. - - -Unclassified Stuff ------------------- - - Please send any other information you can find. - - - And some other stuff (more info is welcome!): - From: root@ultraworld.xs4all.nl (Timo Hilbrink) - To: apenwarr@foxnet.net (Avery Pennarun) - Date: Wed, 26 Oct 1994 02:10:32 +0000 (GMT) - Reply-To: timoh@xs4all.nl - - [...parts deleted...] - - About the jumpers: On my PC130 there is one more jumper, located near the - cable-connector and it's for changing to star or bus topology; - closed: star - open: bus - On the PC500 are some more jumper-pins, one block labeled with RX,PDN,TXI - and another with ALE,LA17,LA18,LA19 these are undocumented.. - - [...more parts deleted...] - - --- CUT --- - - -** Standard Microsystems Corp (SMC) ** -PC100, PC110, PC120, PC130 (8-bit cards) -PC500, PC600 (16-bit cards) ---------------------------------- - - mainly from Avery Pennarun . Values depicted - are from Avery's setup. - - special thanks to Timo Hilbrink for noting that PC120, - 130, 500, and 600 all have the same switches as Avery's PC100. - PC500/600 have several extra, undocumented pins though. (?) - - PC110 settings were verified by Stephen A. Wood - - Also, the JP- and S-numbers probably don't match your card exactly. Try - to find jumpers/switches with the same number of settings - it's - probably more reliable. - - - JP5 [|] : : : : -(IRQ Setting) IRQ2 IRQ3 IRQ4 IRQ5 IRQ7 - Put exactly one jumper on exactly one set of pins. - - - 1 2 3 4 5 6 7 8 9 10 - S1 /----------------------------------\ -(I/O and Memory | 1 1 * 0 0 0 0 * 1 1 0 1 | - addresses) \----------------------------------/ - |--| |--------| |--------| - (a) (b) (m) - - WARNING. It's very important when setting these which way - you're holding the card, and which way you think is '1'! - - If you suspect that your settings are not being made - correctly, try reversing the direction or inverting the - switch positions. - - a: The first digit of the I/O address. - Setting Value - ------- ----- - 00 0 - 01 1 - 10 2 - 11 3 - - b: The second digit of the I/O address. - Setting Value - ------- ----- - 0000 0 - 0001 1 - 0010 2 - ... ... - 1110 E - 1111 F - - The I/O address is in the form ab0. For example, if - a is 0x2 and b is 0xE, the address will be 0x2E0. - - DO NOT SET THIS LESS THAN 0x200!!!!! - - - m: The first digit of the memory address. - Setting Value - ------- ----- - 0000 0 - 0001 1 - 0010 2 - ... ... - 1110 E - 1111 F - - The memory address is in the form m0000. For example, if - m is D, the address will be 0xD0000. - - DO NOT SET THIS TO C0000, F0000, OR LESS THAN A0000! - - 1 2 3 4 5 6 7 8 - S2 /--------------------------\ -(Station Address) | 1 1 0 0 0 0 0 0 | - \--------------------------/ - - Setting Value - ------- ----- - 00000000 00 - 10000000 01 - 01000000 02 - ... - 01111111 FE - 11111111 FF - - Note that this is binary with the digits reversed! - - DO NOT SET THIS TO 0 OR 255 (0xFF)! - - -***************************************************************************** - -** Standard Microsystems Corp (SMC) ** -PC130E/PC270E (8-bit cards) ---------------------------- - - from Juergen Seifert - - -STANDARD MICROSYSTEMS CORPORATION (SMC) ARCNET(R)-PC130E/PC270E -=============================================================== - -This description has been written by Juergen Seifert -using information from the following Original SMC Manual - - "Configuration Guide for - ARCNET(R)-PC130E/PC270 - Network Controller Boards - Pub. # 900.044A - June, 1989" - -ARCNET is a registered trademark of the Datapoint Corporation -SMC is a registered trademark of the Standard Microsystems Corporation - -The PC130E is an enhanced version of the PC130 board, is equipped with a -standard BNC female connector for connection to RG-62/U coax cable. -Since this board is designed both for point-to-point connection in star -networks and for connection to bus networks, it is downwardly compatible -with all the other standard boards designed for coax networks (that is, -the PC120, PC110 and PC100 star topology boards and the PC220, PC210 and -PC200 bus topology boards). - -The PC270E is an enhanced version of the PC260 board, is equipped with two -modular RJ11-type jacks for connection to twisted pair wiring. -It can be used in a star or a daisy-chained network. - - - 8 7 6 5 4 3 2 1 - ________________________________________________________________ - | | S1 | | - | |_________________| | - | Offs|Base |I/O Addr | - | RAM Addr | ___| - | ___ ___ CR3 |___| - | | \/ | CR4 |___| - | | PROM | ___| - | | | N | | 8 - | | SOCKET | o | | 7 - | |________| d | | 6 - | ___________________ e | | 5 - | | | A | S | 4 - | |oo| EXT2 | | d | 2 | 3 - | |oo| EXT1 | SMC | d | | 2 - | |oo| ROM | 90C63 | r |___| 1 - | |oo| IRQ7 | | |o| _____| - | |oo| IRQ5 | | |o| | J1 | - | |oo| IRQ4 | | STAR |_____| - | |oo| IRQ3 | | | J2 | - | |oo| IRQ2 |___________________| |_____| - |___ ______________| - | | - |_____________________________________________| - -Legend: - -SMC 90C63 ARCNET Controller / Transceiver /Logic -S1 1-3: I/O Base Address Select - 4-6: Memory Base Address Select - 7-8: RAM Offset Select -S2 1-8: Node ID Select -EXT Extended Timeout Select -ROM ROM Enable Select -STAR Selected - Star Topology (PC130E only) - Deselected - Bus Topology (PC130E only) -CR3/CR4 Diagnostic LEDs -J1 BNC RG62/U Connector (PC130E only) -J1 6-position Telephone Jack (PC270E only) -J2 6-position Telephone Jack (PC270E only) - -Setting one of the switches to Off/Open means "1", On/Closed means "0". - - -Setting the Node ID -------------------- - -The eight switches in group S2 are used to set the node ID. -These switches work in a way similar to the PC100-series cards; see that -entry for more information. - - -Setting the I/O Base Address ----------------------------- - -The first three switches in switch group S1 are used to select one -of eight possible I/O Base addresses using the following table - - - Switch | Hex I/O - 1 2 3 | Address - -------|-------- - 0 0 0 | 260 - 0 0 1 | 290 - 0 1 0 | 2E0 (Manufacturer's default) - 0 1 1 | 2F0 - 1 0 0 | 300 - 1 0 1 | 350 - 1 1 0 | 380 - 1 1 1 | 3E0 - - -Setting the Base Memory (RAM) buffer Address --------------------------------------------- - -The memory buffer requires 2K of a 16K block of RAM. The base of this -16K block can be located in any of eight positions. -Switches 4-6 of switch group S1 select the Base of the 16K block. -Within that 16K address space, the buffer may be assigned any one of four -positions, determined by the offset, switches 7 and 8 of group S1. - - Switch | Hex RAM | Hex ROM - 4 5 6 7 8 | Address | Address *) - -----------|---------|----------- - 0 0 0 0 0 | C0000 | C2000 - 0 0 0 0 1 | C0800 | C2000 - 0 0 0 1 0 | C1000 | C2000 - 0 0 0 1 1 | C1800 | C2000 - | | - 0 0 1 0 0 | C4000 | C6000 - 0 0 1 0 1 | C4800 | C6000 - 0 0 1 1 0 | C5000 | C6000 - 0 0 1 1 1 | C5800 | C6000 - | | - 0 1 0 0 0 | CC000 | CE000 - 0 1 0 0 1 | CC800 | CE000 - 0 1 0 1 0 | CD000 | CE000 - 0 1 0 1 1 | CD800 | CE000 - | | - 0 1 1 0 0 | D0000 | D2000 (Manufacturer's default) - 0 1 1 0 1 | D0800 | D2000 - 0 1 1 1 0 | D1000 | D2000 - 0 1 1 1 1 | D1800 | D2000 - | | - 1 0 0 0 0 | D4000 | D6000 - 1 0 0 0 1 | D4800 | D6000 - 1 0 0 1 0 | D5000 | D6000 - 1 0 0 1 1 | D5800 | D6000 - | | - 1 0 1 0 0 | D8000 | DA000 - 1 0 1 0 1 | D8800 | DA000 - 1 0 1 1 0 | D9000 | DA000 - 1 0 1 1 1 | D9800 | DA000 - | | - 1 1 0 0 0 | DC000 | DE000 - 1 1 0 0 1 | DC800 | DE000 - 1 1 0 1 0 | DD000 | DE000 - 1 1 0 1 1 | DD800 | DE000 - | | - 1 1 1 0 0 | E0000 | E2000 - 1 1 1 0 1 | E0800 | E2000 - 1 1 1 1 0 | E1000 | E2000 - 1 1 1 1 1 | E1800 | E2000 - -*) To enable the 8K Boot PROM install the jumper ROM. - The default is jumper ROM not installed. - - -Setting the Timeouts and Interrupt ----------------------------------- - -The jumpers labeled EXT1 and EXT2 are used to determine the timeout -parameters. These two jumpers are normally left open. - -To select a hardware interrupt level set one (only one!) of the jumpers -IRQ2, IRQ3, IRQ4, IRQ5, IRQ7. The Manufacturer's default is IRQ2. - - -Configuring the PC130E for Star or Bus Topology ------------------------------------------------ - -The single jumper labeled STAR is used to configure the PC130E board for -star or bus topology. -When the jumper is installed, the board may be used in a star network, when -it is removed, the board can be used in a bus topology. - - -Diagnostic LEDs ---------------- - -Two diagnostic LEDs are visible on the rear bracket of the board. -The green LED monitors the network activity: the red one shows the -board activity: - - Green | Status Red | Status - -------|------------------- ---------|------------------- - on | normal activity flash/on | data transfer - blink | reconfiguration off | no data transfer; - off | defective board or | incorrect memory or - | node ID is zero | I/O address - - -***************************************************************************** - -** Standard Microsystems Corp (SMC) ** -PC500/PC550 Longboard (16-bit cards) -------------------------------------- - - from Juergen Seifert - - -STANDARD MICROSYSTEMS CORPORATION (SMC) ARCNET-PC500/PC550 Long Board -===================================================================== - -Note: There is another Version of the PC500 called Short Version, which - is different in hard- and software! The most important differences - are: - - The long board has no Shared memory. - - On the long board the selection of the interrupt is done by binary - coded switch, on the short board directly by jumper. - -[Avery's note: pay special attention to that: the long board HAS NO SHARED -MEMORY. This means the current Linux-ARCnet driver can't use these cards. -I have obtained a PC500Longboard and will be doing some experiments on it in -the future, but don't hold your breath. Thanks again to Juergen Seifert for -his advice about this!] - -This description has been written by Juergen Seifert -using information from the following Original SMC Manual - - "Configuration Guide for - SMC ARCNET-PC500/PC550 - Series Network Controller Boards - Pub. # 900.033 Rev. A - November, 1989" - -ARCNET is a registered trademark of the Datapoint Corporation -SMC is a registered trademark of the Standard Microsystems Corporation - -The PC500 is equipped with a standard BNC female connector for connection -to RG-62/U coax cable. -The board is designed both for point-to-point connection in star networks -and for connection to bus networks. - -The PC550 is equipped with two modular RJ11-type jacks for connection -to twisted pair wiring. -It can be used in a star or a daisy-chained (BUS) network. - - 1 - 0 9 8 7 6 5 4 3 2 1 6 5 4 3 2 1 - ____________________________________________________________________ - < | SW1 | | SW2 | | - > |_____________________| |_____________| | - < IRQ |I/O Addr | - > ___| - < CR4 |___| - > CR3 |___| - < ___| - > N | | 8 - < o | | 7 - > d | S | 6 - < e | W | 5 - > A | 3 | 4 - < d | | 3 - > d | | 2 - < r |___| 1 - > |o| _____| - < |o| | J1 | - > 3 1 JP6 |_____| - < |o|o| JP2 | J2 | - > |o|o| |_____| - < 4 2__ ______________| - > | | | - <____| |_____________________________________________| - -Legend: - -SW1 1-6: I/O Base Address Select - 7-10: Interrupt Select -SW2 1-6: Reserved for Future Use -SW3 1-8: Node ID Select -JP2 1-4: Extended Timeout Select -JP6 Selected - Star Topology (PC500 only) - Deselected - Bus Topology (PC500 only) -CR3 Green Monitors Network Activity -CR4 Red Monitors Board Activity -J1 BNC RG62/U Connector (PC500 only) -J1 6-position Telephone Jack (PC550 only) -J2 6-position Telephone Jack (PC550 only) - -Setting one of the switches to Off/Open means "1", On/Closed means "0". - - -Setting the Node ID -------------------- - -The eight switches in group SW3 are used to set the node ID. Each node -attached to the network must have an unique node ID which must be -different from 0. -Switch 1 serves as the least significant bit (LSB). - -The node ID is the sum of the values of all switches set to "1" -These values are: - - Switch | Value - -------|------- - 1 | 1 - 2 | 2 - 3 | 4 - 4 | 8 - 5 | 16 - 6 | 32 - 7 | 64 - 8 | 128 - -Some Examples: - - Switch | Hex | Decimal - 8 7 6 5 4 3 2 1 | Node ID | Node ID - ----------------|---------|--------- - 0 0 0 0 0 0 0 0 | not allowed - 0 0 0 0 0 0 0 1 | 1 | 1 - 0 0 0 0 0 0 1 0 | 2 | 2 - 0 0 0 0 0 0 1 1 | 3 | 3 - . . . | | - 0 1 0 1 0 1 0 1 | 55 | 85 - . . . | | - 1 0 1 0 1 0 1 0 | AA | 170 - . . . | | - 1 1 1 1 1 1 0 1 | FD | 253 - 1 1 1 1 1 1 1 0 | FE | 254 - 1 1 1 1 1 1 1 1 | FF | 255 - - -Setting the I/O Base Address ----------------------------- - -The first six switches in switch group SW1 are used to select one -of 32 possible I/O Base addresses using the following table - - Switch | Hex I/O - 6 5 4 3 2 1 | Address - -------------|-------- - 0 1 0 0 0 0 | 200 - 0 1 0 0 0 1 | 210 - 0 1 0 0 1 0 | 220 - 0 1 0 0 1 1 | 230 - 0 1 0 1 0 0 | 240 - 0 1 0 1 0 1 | 250 - 0 1 0 1 1 0 | 260 - 0 1 0 1 1 1 | 270 - 0 1 1 0 0 0 | 280 - 0 1 1 0 0 1 | 290 - 0 1 1 0 1 0 | 2A0 - 0 1 1 0 1 1 | 2B0 - 0 1 1 1 0 0 | 2C0 - 0 1 1 1 0 1 | 2D0 - 0 1 1 1 1 0 | 2E0 (Manufacturer's default) - 0 1 1 1 1 1 | 2F0 - 1 1 0 0 0 0 | 300 - 1 1 0 0 0 1 | 310 - 1 1 0 0 1 0 | 320 - 1 1 0 0 1 1 | 330 - 1 1 0 1 0 0 | 340 - 1 1 0 1 0 1 | 350 - 1 1 0 1 1 0 | 360 - 1 1 0 1 1 1 | 370 - 1 1 1 0 0 0 | 380 - 1 1 1 0 0 1 | 390 - 1 1 1 0 1 0 | 3A0 - 1 1 1 0 1 1 | 3B0 - 1 1 1 1 0 0 | 3C0 - 1 1 1 1 0 1 | 3D0 - 1 1 1 1 1 0 | 3E0 - 1 1 1 1 1 1 | 3F0 - - -Setting the Interrupt ---------------------- - -Switches seven through ten of switch group SW1 are used to select the -interrupt level. The interrupt level is binary coded, so selections -from 0 to 15 would be possible, but only the following eight values will -be supported: 3, 4, 5, 7, 9, 10, 11, 12. - - Switch | IRQ - 10 9 8 7 | - ---------|-------- - 0 0 1 1 | 3 - 0 1 0 0 | 4 - 0 1 0 1 | 5 - 0 1 1 1 | 7 - 1 0 0 1 | 9 (=2) (default) - 1 0 1 0 | 10 - 1 0 1 1 | 11 - 1 1 0 0 | 12 - - -Setting the Timeouts --------------------- - -The two jumpers JP2 (1-4) are used to determine the timeout parameters. -These two jumpers are normally left open. -Refer to the COM9026 Data Sheet for alternate configurations. - - -Configuring the PC500 for Star or Bus Topology ----------------------------------------------- - -The single jumper labeled JP6 is used to configure the PC500 board for -star or bus topology. -When the jumper is installed, the board may be used in a star network, when -it is removed, the board can be used in a bus topology. - - -Diagnostic LEDs ---------------- - -Two diagnostic LEDs are visible on the rear bracket of the board. -The green LED monitors the network activity: the red one shows the -board activity: - - Green | Status Red | Status - -------|------------------- ---------|------------------- - on | normal activity flash/on | data transfer - blink | reconfiguration off | no data transfer; - off | defective board or | incorrect memory or - | node ID is zero | I/O address - - -***************************************************************************** - -** SMC ** -PC710 (8-bit card) ------------------- - - from J.S. van Oosten - -Note: this data is gathered by experimenting and looking at info of other -cards. However, I'm sure I got 99% of the settings right. - -The SMC710 card resembles the PC270 card, but is much more basic (i.e. no -LEDs, RJ11 jacks, etc.) and 8 bit. Here's a little drawing: - - _______________________________________ - | +---------+ +---------+ |____ - | | S2 | | S1 | | - | +---------+ +---------+ | - | | - | +===+ __ | - | | R | | | X-tal ###___ - | | O | |__| ####__'| - | | M | || ### - | +===+ | - | | - | .. JP1 +----------+ | - | .. | big chip | | - | .. | 90C63 | | - | .. | | | - | .. +----------+ | - ------- ----------- - ||||||||||||||||||||| - -The row of jumpers at JP1 actually consists of 8 jumpers, (sometimes -labelled) the same as on the PC270, from top to bottom: EXT2, EXT1, ROM, -IRQ7, IRQ5, IRQ4, IRQ3, IRQ2 (gee, wonder what they would do? :-) ) - -S1 and S2 perform the same function as on the PC270, only their numbers -are swapped (S1 is the nodeaddress, S2 sets IO- and RAM-address). - -I know it works when connected to a PC110 type ARCnet board. - - -***************************************************************************** - -** Possibly SMC ** -LCS-8830(-T) (8 and 16-bit cards) ---------------------------------- - - from Mathias Katzer - - Marek Michalkiewicz says the - LCS-8830 is slightly different from LCS-8830-T. These are 8 bit, BUS - only (the JP0 jumper is hardwired), and BNC only. - -This is a LCS-8830-T made by SMC, I think ('SMC' only appears on one PLCC, -nowhere else, not even on the few Xeroxed sheets from the manual). - -SMC ARCnet Board Type LCS-8830-T - - ------------------------------------ - | | - | JP3 88 8 JP2 | - | ##### | \ | - | ##### ET1 ET2 ###| - | 8 ###| - | U3 SW 1 JP0 ###| Phone Jacks - | -- ###| - | | | | - | | | SW2 | - | | | | - | | | ##### | - | -- ##### #### BNC Connector - | #### - | 888888 JP1 | - | 234567 | - -- ------- - ||||||||||||||||||||||||||| - -------------------------- - - -SW1: DIP-Switches for Station Address -SW2: DIP-Switches for Memory Base and I/O Base addresses - -JP0: If closed, internal termination on (default open) -JP1: IRQ Jumpers -JP2: Boot-ROM enabled if closed -JP3: Jumpers for response timeout - -U3: Boot-ROM Socket - - -ET1 ET2 Response Time Idle Time Reconfiguration Time - - 78 86 840 - X 285 316 1680 - X 563 624 1680 - X X 1130 1237 1680 - -(X means closed jumper) - -(DIP-Switch downwards means "0") - -The station address is binary-coded with SW1. - -The I/O base address is coded with DIP-Switches 6,7 and 8 of SW2: - -Switches Base -678 Address -000 260-26f -100 290-29f -010 2e0-2ef -110 2f0-2ff -001 300-30f -101 350-35f -011 380-38f -111 3e0-3ef - - -DIP Switches 1-5 of SW2 encode the RAM and ROM Address Range: - -Switches RAM ROM -12345 Address Range Address Range -00000 C:0000-C:07ff C:2000-C:3fff -10000 C:0800-C:0fff -01000 C:1000-C:17ff -11000 C:1800-C:1fff -00100 C:4000-C:47ff C:6000-C:7fff -10100 C:4800-C:4fff -01100 C:5000-C:57ff -11100 C:5800-C:5fff -00010 C:C000-C:C7ff C:E000-C:ffff -10010 C:C800-C:Cfff -01010 C:D000-C:D7ff -11010 C:D800-C:Dfff -00110 D:0000-D:07ff D:2000-D:3fff -10110 D:0800-D:0fff -01110 D:1000-D:17ff -11110 D:1800-D:1fff -00001 D:4000-D:47ff D:6000-D:7fff -10001 D:4800-D:4fff -01001 D:5000-D:57ff -11001 D:5800-D:5fff -00101 D:8000-D:87ff D:A000-D:bfff -10101 D:8800-D:8fff -01101 D:9000-D:97ff -11101 D:9800-D:9fff -00011 D:C000-D:c7ff D:E000-D:ffff -10011 D:C800-D:cfff -01011 D:D000-D:d7ff -11011 D:D800-D:dfff -00111 E:0000-E:07ff E:2000-E:3fff -10111 E:0800-E:0fff -01111 E:1000-E:17ff -11111 E:1800-E:1fff - - -***************************************************************************** - -** PureData Corp ** -PDI507 (8-bit card) --------------------- - - from Mark Rejhon (slight modifications by Avery) - - Avery's note: I think PDI508 cards (but definitely NOT PDI508Plus cards) - are mostly the same as this. PDI508Plus cards appear to be mainly - software-configured. - -Jumpers: - There is a jumper array at the bottom of the card, near the edge - connector. This array is labelled J1. They control the IRQs and - something else. Put only one jumper on the IRQ pins. - - ETS1, ETS2 are for timing on very long distance networks. See the - more general information near the top of this file. - - There is a J2 jumper on two pins. A jumper should be put on them, - since it was already there when I got the card. I don't know what - this jumper is for though. - - There is a two-jumper array for J3. I don't know what it is for, - but there were already two jumpers on it when I got the card. It's - a six pin grid in a two-by-three fashion. The jumpers were - configured as follows: - - .-------. - o | o o | - :-------: ------> Accessible end of card with connectors - o | o o | in this direction -------> - `-------' - -Carl de Billy explains J3 and J4: - - J3 Diagram: - - .-------. - o | o o | - :-------: TWIST Technology - o | o o | - `-------' - .-------. - | o o | o - :-------: COAX Technology - | o o | o - `-------' - - - If using coax cable in a bus topology the J4 jumper must be removed; - place it on one pin. - - - If using bus topology with twisted pair wiring move the J3 - jumpers so they connect the middle pin and the pins closest to the RJ11 - Connectors. Also the J4 jumper must be removed; place it on one pin of - J4 jumper for storage. - - - If using star topology with twisted pair wiring move the J3 - jumpers so they connect the middle pin and the pins closest to the RJ11 - connectors. - - -DIP Switches: - - The DIP switches accessible on the accessible end of the card while - it is installed, is used to set the ARCnet address. There are 8 - switches. Use an address from 1 to 254. - - Switch No. - 12345678 ARCnet address - ----------------------------------------- - 00000000 FF (Don't use this!) - 00000001 FE - 00000010 FD - .... - 11111101 2 - 11111110 1 - 11111111 0 (Don't use this!) - - There is another array of eight DIP switches at the top of the - card. There are five labelled MS0-MS4 which seem to control the - memory address, and another three labelled IO0-IO2 which seem to - control the base I/O address of the card. - - This was difficult to test by trial and error, and the I/O addresses - are in a weird order. This was tested by setting the DIP switches, - rebooting the computer, and attempting to load ARCETHER at various - addresses (mostly between 0x200 and 0x400). The address that caused - the red transmit LED to blink, is the one that I thought works. - - Also, the address 0x3D0 seem to have a special meaning, since the - ARCETHER packet driver loaded fine, but without the red LED - blinking. I don't know what 0x3D0 is for though. I recommend using - an address of 0x300 since Windows may not like addresses below - 0x300. - - IO Switch No. - 210 I/O address - ------------------------------- - 111 0x260 - 110 0x290 - 101 0x2E0 - 100 0x2F0 - 011 0x300 - 010 0x350 - 001 0x380 - 000 0x3E0 - - The memory switches set a reserved address space of 0x1000 bytes - (0x100 segment units, or 4k). For example if I set an address of - 0xD000, it will use up addresses 0xD000 to 0xD100. - - The memory switches were tested by booting using QEMM386 stealth, - and using LOADHI to see what address automatically became excluded - from the upper memory regions, and then attempting to load ARCETHER - using these addresses. - - I recommend using an ARCnet memory address of 0xD000, and putting - the EMS page frame at 0xC000 while using QEMM stealth mode. That - way, you get contiguous high memory from 0xD100 almost all the way - the end of the megabyte. - - Memory Switch 0 (MS0) didn't seem to work properly when set to OFF - on my card. It could be malfunctioning on my card. Experiment with - it ON first, and if it doesn't work, set it to OFF. (It may be a - modifier for the 0x200 bit?) - - MS Switch No. - 43210 Memory address - -------------------------------- - 00001 0xE100 (guessed - was not detected by QEMM) - 00011 0xE000 (guessed - was not detected by QEMM) - 00101 0xDD00 - 00111 0xDC00 - 01001 0xD900 - 01011 0xD800 - 01101 0xD500 - 01111 0xD400 - 10001 0xD100 - 10011 0xD000 - 10101 0xCD00 - 10111 0xCC00 - 11001 0xC900 (guessed - crashes tested system) - 11011 0xC800 (guessed - crashes tested system) - 11101 0xC500 (guessed - crashes tested system) - 11111 0xC400 (guessed - crashes tested system) - - -***************************************************************************** - -** CNet Technology Inc. ** -120 Series (8-bit cards) ------------------------- - - from Juergen Seifert - - -CNET TECHNOLOGY INC. (CNet) ARCNET 120A SERIES -============================================== - -This description has been written by Juergen Seifert -using information from the following Original CNet Manual - - "ARCNET - USER'S MANUAL - for - CN120A - CN120AB - CN120TP - CN120ST - CN120SBT - P/N:12-01-0007 - Revision 3.00" - -ARCNET is a registered trademark of the Datapoint Corporation - -P/N 120A ARCNET 8 bit XT/AT Star -P/N 120AB ARCNET 8 bit XT/AT Bus -P/N 120TP ARCNET 8 bit XT/AT Twisted Pair -P/N 120ST ARCNET 8 bit XT/AT Star, Twisted Pair -P/N 120SBT ARCNET 8 bit XT/AT Star, Bus, Twisted Pair - - __________________________________________________________________ - | | - | ___| - | LED |___| - | ___| - | N | | ID7 - | o | | ID6 - | d | S | ID5 - | e | W | ID4 - | ___________________ A | 2 | ID3 - | | | d | | ID2 - | | | 1 2 3 4 5 6 7 8 d | | ID1 - | | | _________________ r |___| ID0 - | | 90C65 || SW1 | ____| - | JP 8 7 | ||_________________| | | - | |o|o| JP1 | | | J2 | - | |o|o| |oo| | | JP 1 1 1 | | - | ______________ | | 0 1 2 |____| - | | PROM | |___________________| |o|o|o| _____| - | > SOCKET | JP 6 5 4 3 2 |o|o|o| | J1 | - | |______________| |o|o|o|o|o| |o|o|o| |_____| - |_____ |o|o|o|o|o| ______________| - | | - |_____________________________________________| - -Legend: - -90C65 ARCNET Probe -S1 1-5: Base Memory Address Select - 6-8: Base I/O Address Select -S2 1-8: Node ID Select (ID0-ID7) -JP1 ROM Enable Select -JP2 IRQ2 -JP3 IRQ3 -JP4 IRQ4 -JP5 IRQ5 -JP6 IRQ7 -JP7/JP8 ET1, ET2 Timeout Parameters -JP10/JP11 Coax / Twisted Pair Select (CN120ST/SBT only) -JP12 Terminator Select (CN120AB/ST/SBT only) -J1 BNC RG62/U Connector (all except CN120TP) -J2 Two 6-position Telephone Jack (CN120TP/ST/SBT only) - -Setting one of the switches to Off means "1", On means "0". - - -Setting the Node ID -------------------- - -The eight switches in SW2 are used to set the node ID. Each node attached -to the network must have an unique node ID which must be different from 0. -Switch 1 (ID0) serves as the least significant bit (LSB). - -The node ID is the sum of the values of all switches set to "1" -These values are: - - Switch | Label | Value - -------|-------|------- - 1 | ID0 | 1 - 2 | ID1 | 2 - 3 | ID2 | 4 - 4 | ID3 | 8 - 5 | ID4 | 16 - 6 | ID5 | 32 - 7 | ID6 | 64 - 8 | ID7 | 128 - -Some Examples: - - Switch | Hex | Decimal - 8 7 6 5 4 3 2 1 | Node ID | Node ID - ----------------|---------|--------- - 0 0 0 0 0 0 0 0 | not allowed - 0 0 0 0 0 0 0 1 | 1 | 1 - 0 0 0 0 0 0 1 0 | 2 | 2 - 0 0 0 0 0 0 1 1 | 3 | 3 - . . . | | - 0 1 0 1 0 1 0 1 | 55 | 85 - . . . | | - 1 0 1 0 1 0 1 0 | AA | 170 - . . . | | - 1 1 1 1 1 1 0 1 | FD | 253 - 1 1 1 1 1 1 1 0 | FE | 254 - 1 1 1 1 1 1 1 1 | FF | 255 - - -Setting the I/O Base Address ----------------------------- - -The last three switches in switch block SW1 are used to select one -of eight possible I/O Base addresses using the following table - - - Switch | Hex I/O - 6 7 8 | Address - ------------|-------- - ON ON ON | 260 - OFF ON ON | 290 - ON OFF ON | 2E0 (Manufacturer's default) - OFF OFF ON | 2F0 - ON ON OFF | 300 - OFF ON OFF | 350 - ON OFF OFF | 380 - OFF OFF OFF | 3E0 - - -Setting the Base Memory (RAM) buffer Address --------------------------------------------- - -The memory buffer (RAM) requires 2K. The base of this buffer can be -located in any of eight positions. The address of the Boot Prom is -memory base + 8K or memory base + 0x2000. -Switches 1-5 of switch block SW1 select the Memory Base address. - - Switch | Hex RAM | Hex ROM - 1 2 3 4 5 | Address | Address *) - --------------------|---------|----------- - ON ON ON ON ON | C0000 | C2000 - ON ON OFF ON ON | C4000 | C6000 - ON ON ON OFF ON | CC000 | CE000 - ON ON OFF OFF ON | D0000 | D2000 (Manufacturer's default) - ON ON ON ON OFF | D4000 | D6000 - ON ON OFF ON OFF | D8000 | DA000 - ON ON ON OFF OFF | DC000 | DE000 - ON ON OFF OFF OFF | E0000 | E2000 - -*) To enable the Boot ROM install the jumper JP1 - -Note: Since the switches 1 and 2 are always set to ON it may be possible - that they can be used to add an offset of 2K, 4K or 6K to the base - address, but this feature is not documented in the manual and I - haven't tested it yet. - - -Setting the Interrupt Line --------------------------- - -To select a hardware interrupt level install one (only one!) of the jumpers -JP2, JP3, JP4, JP5, JP6. JP2 is the default. - - Jumper | IRQ - -------|----- - 2 | 2 - 3 | 3 - 4 | 4 - 5 | 5 - 6 | 7 - - -Setting the Internal Terminator on CN120AB/TP/SBT --------------------------------------------------- - -The jumper JP12 is used to enable the internal terminator. - - ----- - 0 | 0 | - ----- ON | | ON - | 0 | | 0 | - | | OFF ----- OFF - | 0 | 0 - ----- - Terminator Terminator - disabled enabled - - -Selecting the Connector Type on CN120ST/SBT -------------------------------------------- - - JP10 JP11 JP10 JP11 - ----- ----- - 0 0 | 0 | | 0 | - ----- ----- | | | | - | 0 | | 0 | | 0 | | 0 | - | | | | ----- ----- - | 0 | | 0 | 0 0 - ----- ----- - Coaxial Cable Twisted Pair Cable - (Default) - - -Setting the Timeout Parameters ------------------------------- - -The jumpers labeled EXT1 and EXT2 are used to determine the timeout -parameters. These two jumpers are normally left open. - - - -***************************************************************************** - -** CNet Technology Inc. ** -160 Series (16-bit cards) -------------------------- - - from Juergen Seifert - -CNET TECHNOLOGY INC. (CNet) ARCNET 160A SERIES -============================================== - -This description has been written by Juergen Seifert -using information from the following Original CNet Manual - - "ARCNET - USER'S MANUAL - for - CN160A - CN160AB - CN160TP - P/N:12-01-0006 - Revision 3.00" - -ARCNET is a registered trademark of the Datapoint Corporation - -P/N 160A ARCNET 16 bit XT/AT Star -P/N 160AB ARCNET 16 bit XT/AT Bus -P/N 160TP ARCNET 16 bit XT/AT Twisted Pair - - ___________________________________________________________________ - < _________________________ ___| - > |oo| JP2 | | LED |___| - < |oo| JP1 | 9026 | LED |___| - > |_________________________| ___| - < N | | ID7 - > 1 o | | ID6 - < 1 2 3 4 5 6 7 8 9 0 d | S | ID5 - > _______________ _____________________ e | W | ID4 - < | PROM | | SW1 | A | 2 | ID3 - > > SOCKET | |_____________________| d | | ID2 - < |_______________| | IO-Base | MEM | d | | ID1 - > r |___| ID0 - < ____| - > | | - < | J1 | - > | | - < |____| - > 1 1 1 1 | - < 3 4 5 6 7 JP 8 9 0 1 2 3 | - > |o|o|o|o|o| |o|o|o|o|o|o| | - < |o|o|o|o|o| __ |o|o|o|o|o|o| ___________| - > | | | - <____________| |_______________________________________| - -Legend: - -9026 ARCNET Probe -SW1 1-6: Base I/O Address Select - 7-10: Base Memory Address Select -SW2 1-8: Node ID Select (ID0-ID7) -JP1/JP2 ET1, ET2 Timeout Parameters -JP3-JP13 Interrupt Select -J1 BNC RG62/U Connector (CN160A/AB only) -J1 Two 6-position Telephone Jack (CN160TP only) -LED - -Setting one of the switches to Off means "1", On means "0". - - -Setting the Node ID -------------------- - -The eight switches in SW2 are used to set the node ID. Each node attached -to the network must have an unique node ID which must be different from 0. -Switch 1 (ID0) serves as the least significant bit (LSB). - -The node ID is the sum of the values of all switches set to "1" -These values are: - - Switch | Label | Value - -------|-------|------- - 1 | ID0 | 1 - 2 | ID1 | 2 - 3 | ID2 | 4 - 4 | ID3 | 8 - 5 | ID4 | 16 - 6 | ID5 | 32 - 7 | ID6 | 64 - 8 | ID7 | 128 - -Some Examples: - - Switch | Hex | Decimal - 8 7 6 5 4 3 2 1 | Node ID | Node ID - ----------------|---------|--------- - 0 0 0 0 0 0 0 0 | not allowed - 0 0 0 0 0 0 0 1 | 1 | 1 - 0 0 0 0 0 0 1 0 | 2 | 2 - 0 0 0 0 0 0 1 1 | 3 | 3 - . . . | | - 0 1 0 1 0 1 0 1 | 55 | 85 - . . . | | - 1 0 1 0 1 0 1 0 | AA | 170 - . . . | | - 1 1 1 1 1 1 0 1 | FD | 253 - 1 1 1 1 1 1 1 0 | FE | 254 - 1 1 1 1 1 1 1 1 | FF | 255 - - -Setting the I/O Base Address ----------------------------- - -The first six switches in switch block SW1 are used to select the I/O Base -address using the following table: - - Switch | Hex I/O - 1 2 3 4 5 6 | Address - ------------------------|-------- - OFF ON ON OFF OFF ON | 260 - OFF ON OFF ON ON OFF | 290 - OFF ON OFF OFF OFF ON | 2E0 (Manufacturer's default) - OFF ON OFF OFF OFF OFF | 2F0 - OFF OFF ON ON ON ON | 300 - OFF OFF ON OFF ON OFF | 350 - OFF OFF OFF ON ON ON | 380 - OFF OFF OFF OFF OFF ON | 3E0 - -Note: Other IO-Base addresses seem to be selectable, but only the above - combinations are documented. - - -Setting the Base Memory (RAM) buffer Address --------------------------------------------- - -The switches 7-10 of switch block SW1 are used to select the Memory -Base address of the RAM (2K) and the PROM. - - Switch | Hex RAM | Hex ROM - 7 8 9 10 | Address | Address - ----------------|---------|----------- - OFF OFF ON ON | C0000 | C8000 - OFF OFF ON OFF | D0000 | D8000 (Default) - OFF OFF OFF ON | E0000 | E8000 - -Note: Other MEM-Base addresses seem to be selectable, but only the above - combinations are documented. - - -Setting the Interrupt Line --------------------------- - -To select a hardware interrupt level install one (only one!) of the jumpers -JP3 through JP13 using the following table: - - Jumper | IRQ - -------|----------------- - 3 | 14 - 4 | 15 - 5 | 12 - 6 | 11 - 7 | 10 - 8 | 3 - 9 | 4 - 10 | 5 - 11 | 6 - 12 | 7 - 13 | 2 (=9) Default! - -Note: - Do not use JP11=IRQ6, it may conflict with your Floppy Disk - Controller - - Use JP3=IRQ14 only, if you don't have an IDE-, MFM-, or RLL- - Hard Disk, it may conflict with their controllers - - -Setting the Timeout Parameters ------------------------------- - -The jumpers labeled JP1 and JP2 are used to determine the timeout -parameters. These two jumpers are normally left open. - - -***************************************************************************** - -** Lantech ** -8-bit card, unknown model -------------------------- - - from Vlad Lungu - his e-mail address seemed broken at - the time I tried to reach him. Sorry Vlad, if you didn't get my reply. - - ________________________________________________________________ - | 1 8 | - | ___________ __| - | | SW1 | LED |__| - | |__________| | - | ___| - | _____________________ |S | 8 - | | | |W | - | | | |2 | - | | | |__| 1 - | | UM9065L | |o| JP4 ____|____ - | | | |o| | CN | - | | | |________| - | | | | - | |___________________| | - | | - | | - | _____________ | - | | | | - | | PROM | |ooooo| JP6 | - | |____________| |ooooo| | - |_____________ _ _| - |____________________________________________| |__| - - -UM9065L : ARCnet Controller - -SW 1 : Shared Memory Address and I/O Base - - ON=0 - - 12345|Memory Address - -----|-------------- - 00001| D4000 - 00010| CC000 - 00110| D0000 - 01110| D1000 - 01101| D9000 - 10010| CC800 - 10011| DC800 - 11110| D1800 - -It seems that the bits are considered in reverse order. Also, you must -observe that some of those addresses are unusual and I didn't probe them; I -used a memory dump in DOS to identify them. For the 00000 configuration and -some others that I didn't write here the card seems to conflict with the -video card (an S3 GENDAC). I leave the full decoding of those addresses to -you. - - 678| I/O Address - ---|------------ - 000| 260 - 001| failed probe - 010| 2E0 - 011| 380 - 100| 290 - 101| 350 - 110| failed probe - 111| 3E0 - -SW 2 : Node ID (binary coded) - -JP 4 : Boot PROM enable CLOSE - enabled - OPEN - disabled - -JP 6 : IRQ set (ONLY ONE jumper on 1-5 for IRQ 2-6) - - -***************************************************************************** - -** Acer ** -8-bit card, Model 5210-003 --------------------------- - - from Vojtech Pavlik using portions of the existing - arcnet-hardware file. - -This is a 90C26 based card. Its configuration seems similar to the SMC -PC100, but has some additional jumpers I don't know the meaning of. - - __ - | | - ___________|__|_________________________ - | | | | - | | BNC | | - | |______| ___| - | _____________________ |___ - | | | | - | | Hybrid IC | | - | | | o|o J1 | - | |_____________________| 8|8 | - | 8|8 J5 | - | o|o | - | 8|8 | - |__ 8|8 | - (|__| LED o|o | - | 8|8 | - | 8|8 J15 | - | | - | _____ | - | | | _____ | - | | | | | ___| - | | | | | | - | _____ | ROM | | UFS | | - | | | | | | | | - | | | ___ | | | | | - | | | | | |__.__| |__.__| | - | | NCR | |XTL| _____ _____ | - | | | |___| | | | | | - | |90C26| | | | | | - | | | | RAM | | UFS | | - | | | J17 o|o | | | | | - | | | J16 o|o | | | | | - | |__.__| |__.__| |__.__| | - | ___ | - | | |8 | - | |SW2| | - | | | | - | |___|1 | - | ___ | - | | |10 J18 o|o | - | | | o|o | - | |SW1| o|o | - | | | J21 o|o | - | |___|1 | - | | - |____________________________________| - - -Legend: - -90C26 ARCNET Chip -XTL 20 MHz Crystal -SW1 1-6 Base I/O Address Select - 7-10 Memory Address Select -SW2 1-8 Node ID Select (ID0-ID7) -J1-J5 IRQ Select -J6-J21 Unknown (Probably extra timeouts & ROM enable ...) -LED1 Activity LED -BNC Coax connector (STAR ARCnet) -RAM 2k of SRAM -ROM Boot ROM socket -UFS Unidentified Flying Sockets - - -Setting the Node ID -------------------- - -The eight switches in SW2 are used to set the node ID. Each node attached -to the network must have an unique node ID which must not be 0. -Switch 1 (ID0) serves as the least significant bit (LSB). - -Setting one of the switches to OFF means "1", ON means "0". - -The node ID is the sum of the values of all switches set to "1" -These values are: - - Switch | Value - -------|------- - 1 | 1 - 2 | 2 - 3 | 4 - 4 | 8 - 5 | 16 - 6 | 32 - 7 | 64 - 8 | 128 - -Don't set this to 0 or 255; these values are reserved. - - -Setting the I/O Base Address ----------------------------- - -The switches 1 to 6 of switch block SW1 are used to select one -of 32 possible I/O Base addresses using the following tables - - | Hex - Switch | Value - -------|------- - 1 | 200 - 2 | 100 - 3 | 80 - 4 | 40 - 5 | 20 - 6 | 10 - -The I/O address is sum of all switches set to "1". Remember that -the I/O address space bellow 0x200 is RESERVED for mainboard, so -switch 1 should be ALWAYS SET TO OFF. - - -Setting the Base Memory (RAM) buffer Address --------------------------------------------- - -The memory buffer (RAM) requires 2K. The base of this buffer can be -located in any of sixteen positions. However, the addresses below -A0000 are likely to cause system hang because there's main RAM. - -Jumpers 7-10 of switch block SW1 select the Memory Base address. - - Switch | Hex RAM - 7 8 9 10 | Address - ----------------|--------- - OFF OFF OFF OFF | F0000 (conflicts with main BIOS) - OFF OFF OFF ON | E0000 - OFF OFF ON OFF | D0000 - OFF OFF ON ON | C0000 (conflicts with video BIOS) - OFF ON OFF OFF | B0000 (conflicts with mono video) - OFF ON OFF ON | A0000 (conflicts with graphics) - - -Setting the Interrupt Line --------------------------- - -Jumpers 1-5 of the jumper block J1 control the IRQ level. ON means -shorted, OFF means open. - - Jumper | IRQ - 1 2 3 4 5 | - ---------------------------- - ON OFF OFF OFF OFF | 7 - OFF ON OFF OFF OFF | 5 - OFF OFF ON OFF OFF | 4 - OFF OFF OFF ON OFF | 3 - OFF OFF OFF OFF ON | 2 - - -Unknown jumpers & sockets -------------------------- - -I know nothing about these. I just guess that J16&J17 are timeout -jumpers and maybe one of J18-J21 selects ROM. Also J6-J10 and -J11-J15 are connecting IRQ2-7 to some pins on the UFSs. I can't -guess the purpose. - - -***************************************************************************** - -** Datapoint? ** -LAN-ARC-8, an 8-bit card ------------------------- - - from Vojtech Pavlik - -This is another SMC 90C65-based ARCnet card. I couldn't identify the -manufacturer, but it might be DataPoint, because the card has the -original arcNet logo in its upper right corner. - - _______________________________________________________ - | _________ | - | | SW2 | ON arcNet | - | |_________| OFF ___| - | _____________ 1 ______ 8 | | 8 - | | | SW1 | XTAL | ____________ | S | - | > RAM (2k) | |______|| | | W | - | |_____________| | H | | 3 | - | _________|_____ y | |___| 1 - | _________ | | |b | | - | |_________| | | |r | | - | | SMC | |i | | - | | 90C65| |d | | - | _________ | | | | | - | | SW1 | ON | | |I | | - | |_________| OFF |_________|_____/C | _____| - | 1 8 | | | |___ - | ______________ | | | BNC |___| - | | | |____________| |_____| - | > EPROM SOCKET | _____________ | - | |______________| |_____________| | - | ______________| - | | - |________________________________________| - -Legend: - -90C65 ARCNET Chip -SW1 1-5: Base Memory Address Select - 6-8: Base I/O Address Select -SW2 1-8: Node ID Select -SW3 1-5: IRQ Select - 6-7: Extra Timeout - 8 : ROM Enable -BNC Coax connector -XTAL 20 MHz Crystal - - -Setting the Node ID -------------------- - -The eight switches in SW3 are used to set the node ID. Each node attached -to the network must have an unique node ID which must not be 0. -Switch 1 serves as the least significant bit (LSB). - -Setting one of the switches to Off means "1", On means "0". - -The node ID is the sum of the values of all switches set to "1" -These values are: - - Switch | Value - -------|------- - 1 | 1 - 2 | 2 - 3 | 4 - 4 | 8 - 5 | 16 - 6 | 32 - 7 | 64 - 8 | 128 - - -Setting the I/O Base Address ----------------------------- - -The last three switches in switch block SW1 are used to select one -of eight possible I/O Base addresses using the following table - - - Switch | Hex I/O - 6 7 8 | Address - ------------|-------- - ON ON ON | 260 - OFF ON ON | 290 - ON OFF ON | 2E0 (Manufacturer's default) - OFF OFF ON | 2F0 - ON ON OFF | 300 - OFF ON OFF | 350 - ON OFF OFF | 380 - OFF OFF OFF | 3E0 - - -Setting the Base Memory (RAM) buffer Address --------------------------------------------- - -The memory buffer (RAM) requires 2K. The base of this buffer can be -located in any of eight positions. The address of the Boot Prom is -memory base + 0x2000. -Jumpers 3-5 of switch block SW1 select the Memory Base address. - - Switch | Hex RAM | Hex ROM - 1 2 3 4 5 | Address | Address *) - --------------------|---------|----------- - ON ON ON ON ON | C0000 | C2000 - ON ON OFF ON ON | C4000 | C6000 - ON ON ON OFF ON | CC000 | CE000 - ON ON OFF OFF ON | D0000 | D2000 (Manufacturer's default) - ON ON ON ON OFF | D4000 | D6000 - ON ON OFF ON OFF | D8000 | DA000 - ON ON ON OFF OFF | DC000 | DE000 - ON ON OFF OFF OFF | E0000 | E2000 - -*) To enable the Boot ROM set the switch 8 of switch block SW3 to position ON. - -The switches 1 and 2 probably add 0x0800 and 0x1000 to RAM base address. - - -Setting the Interrupt Line --------------------------- - -Switches 1-5 of the switch block SW3 control the IRQ level. - - Jumper | IRQ - 1 2 3 4 5 | - ---------------------------- - ON OFF OFF OFF OFF | 3 - OFF ON OFF OFF OFF | 4 - OFF OFF ON OFF OFF | 5 - OFF OFF OFF ON OFF | 7 - OFF OFF OFF OFF ON | 2 - - -Setting the Timeout Parameters ------------------------------- - -The switches 6-7 of the switch block SW3 are used to determine the timeout -parameters. These two switches are normally left in the OFF position. - - -***************************************************************************** - -** Topware ** -8-bit card, TA-ARC/10 -------------------------- - - from Vojtech Pavlik - -This is another very similar 90C65 card. Most of the switches and jumpers -are the same as on other clones. - - _____________________________________________________________________ -| ___________ | | ______ | -| |SW2 NODE ID| | | | XTAL | | -| |___________| | Hybrid IC | |______| | -| ___________ | | __| -| |SW1 MEM+I/O| |_________________________| LED1|__|) -| |___________| 1 2 | -| J3 |o|o| TIMEOUT ______| -| ______________ |o|o| | | -| | | ___________________ | RJ | -| > EPROM SOCKET | | \ |------| -|J2 |______________| | | | | -||o| | | |______| -||o| ROM ENABLE | SMC | _________ | -| _____________ | 90C65 | |_________| _____| -| | | | | | |___ -| > RAM (2k) | | | | BNC |___| -| |_____________| | | |_____| -| |____________________| | -| ________ IRQ 2 3 4 5 7 ___________ | -||________| |o|o|o|o|o| |___________| | -|________ J1|o|o|o|o|o| ______________| - | | - |_____________________________________________| - -Legend: - -90C65 ARCNET Chip -XTAL 20 MHz Crystal -SW1 1-5 Base Memory Address Select - 6-8 Base I/O Address Select -SW2 1-8 Node ID Select (ID0-ID7) -J1 IRQ Select -J2 ROM Enable -J3 Extra Timeout -LED1 Activity LED -BNC Coax connector (BUS ARCnet) -RJ Twisted Pair Connector (daisy chain) - - -Setting the Node ID -------------------- - -The eight switches in SW2 are used to set the node ID. Each node attached to -the network must have an unique node ID which must not be 0. Switch 1 (ID0) -serves as the least significant bit (LSB). - -Setting one of the switches to Off means "1", On means "0". - -The node ID is the sum of the values of all switches set to "1" -These values are: - - Switch | Label | Value - -------|-------|------- - 1 | ID0 | 1 - 2 | ID1 | 2 - 3 | ID2 | 4 - 4 | ID3 | 8 - 5 | ID4 | 16 - 6 | ID5 | 32 - 7 | ID6 | 64 - 8 | ID7 | 128 - -Setting the I/O Base Address ----------------------------- - -The last three switches in switch block SW1 are used to select one -of eight possible I/O Base addresses using the following table: - - - Switch | Hex I/O - 6 7 8 | Address - ------------|-------- - ON ON ON | 260 (Manufacturer's default) - OFF ON ON | 290 - ON OFF ON | 2E0 - OFF OFF ON | 2F0 - ON ON OFF | 300 - OFF ON OFF | 350 - ON OFF OFF | 380 - OFF OFF OFF | 3E0 - - -Setting the Base Memory (RAM) buffer Address --------------------------------------------- - -The memory buffer (RAM) requires 2K. The base of this buffer can be -located in any of eight positions. The address of the Boot Prom is -memory base + 0x2000. -Jumpers 3-5 of switch block SW1 select the Memory Base address. - - Switch | Hex RAM | Hex ROM - 1 2 3 4 5 | Address | Address *) - --------------------|---------|----------- - ON ON ON ON ON | C0000 | C2000 - ON ON OFF ON ON | C4000 | C6000 (Manufacturer's default) - ON ON ON OFF ON | CC000 | CE000 - ON ON OFF OFF ON | D0000 | D2000 - ON ON ON ON OFF | D4000 | D6000 - ON ON OFF ON OFF | D8000 | DA000 - ON ON ON OFF OFF | DC000 | DE000 - ON ON OFF OFF OFF | E0000 | E2000 - -*) To enable the Boot ROM short the jumper J2. - -The jumpers 1 and 2 probably add 0x0800 and 0x1000 to RAM address. - - -Setting the Interrupt Line --------------------------- - -Jumpers 1-5 of the jumper block J1 control the IRQ level. ON means -shorted, OFF means open. - - Jumper | IRQ - 1 2 3 4 5 | - ---------------------------- - ON OFF OFF OFF OFF | 2 - OFF ON OFF OFF OFF | 3 - OFF OFF ON OFF OFF | 4 - OFF OFF OFF ON OFF | 5 - OFF OFF OFF OFF ON | 7 - - -Setting the Timeout Parameters ------------------------------- - -The jumpers J3 are used to set the timeout parameters. These two -jumpers are normally left open. - - -***************************************************************************** - -** Thomas-Conrad ** -Model #500-6242-0097 REV A (8-bit card) ---------------------------------------- - - from Lars Karlsson <100617.3473@compuserve.com> - - ________________________________________________________ - | ________ ________ |_____ - | |........| |........| | - | |________| |________| ___| - | SW 3 SW 1 | | - | Base I/O Base Addr. Station | | - | address | | - | ______ switch | | - | | | | | - | | | |___| - | | | ______ |___._ - | |______| |______| ____| BNC - | Jumper- _____| Connector - | Main chip block _ __| ' - | | | | RJ Connector - | |_| | with 110 Ohm - | |__ Terminator - | ___________ __| - | |...........| | RJ-jack - | |...........| _____ | (unused) - | |___________| |_____| |__ - | Boot PROM socket IRQ-jumpers |_ Diagnostic - |________ __ _| LED (red) - | | | | | | | | | | | | | | | | | | | | | | - | | | | | | | | | | | | | | | | | | | | |________| - | - | - -And here are the settings for some of the switches and jumpers on the cards. - - - I/O - - 1 2 3 4 5 6 7 8 - -2E0----- 0 0 0 1 0 0 0 1 -2F0----- 0 0 0 1 0 0 0 0 -300----- 0 0 0 0 1 1 1 1 -350----- 0 0 0 0 1 1 1 0 - -"0" in the above example means switch is off "1" means that it is on. - - - ShMem address. - - 1 2 3 4 5 6 7 8 - -CX00--0 0 1 1 | | | -DX00--0 0 1 0 | -X000--------- 1 1 | -X400--------- 1 0 | -X800--------- 0 1 | -XC00--------- 0 0 -ENHANCED----------- 1 -COMPATIBLE--------- 0 - - - IRQ - - - 3 4 5 7 2 - . . . . . - . . . . . - - -There is a DIP-switch with 8 switches, used to set the shared memory address -to be used. The first 6 switches set the address, the 7th doesn't have any -function, and the 8th switch is used to select "compatible" or "enhanced". -When I got my two cards, one of them had this switch set to "enhanced". That -card didn't work at all, it wasn't even recognized by the driver. The other -card had this switch set to "compatible" and it behaved absolutely normally. I -guess that the switch on one of the cards, must have been changed accidentally -when the card was taken out of its former host. The question remains -unanswered, what is the purpose of the "enhanced" position? - -[Avery's note: "enhanced" probably either disables shared memory (use IO -ports instead) or disables IO ports (use memory addresses instead). This -varies by the type of card involved. I fail to see how either of these -enhance anything. Send me more detailed information about this mode, or -just use "compatible" mode instead.] - - -***************************************************************************** - -** Waterloo Microsystems Inc. ?? ** -8-bit card (C) 1985 -------------------- - - from Robert Michael Best - -[Avery's note: these don't work with my driver for some reason. These cards -SEEM to have settings similar to the PDI508Plus, which is -software-configured and doesn't work with my driver either. The "Waterloo -chip" is a boot PROM, probably designed specifically for the University of -Waterloo. If you have any further information about this card, please -e-mail me.] - -The probe has not been able to detect the card on any of the J2 settings, -and I tried them again with the "Waterloo" chip removed. - - _____________________________________________________________________ -| \/ \/ ___ __ __ | -| C4 C4 |^| | M || ^ ||^| | -| -- -- |_| | 5 || || | C3 | -| \/ \/ C10 |___|| ||_| | -| C4 C4 _ _ | | ?? | -| -- -- | \/ || | | -| | || | | -| | || C1 | | -| | || | \/ _____| -| | C6 || | C9 | |___ -| | || | -- | BNC |___| -| | || | >C7| |_____| -| | || | | -| __ __ |____||_____| 1 2 3 6 | -|| ^ | >C4| |o|o|o|o|o|o| J2 >C4| | -|| | |o|o|o|o|o|o| | -|| C2 | >C4| >C4| | -|| | >C8| | -|| | 2 3 4 5 6 7 IRQ >C4| | -||_____| |o|o|o|o|o|o| J3 | -|_______ |o|o|o|o|o|o| _______________| - | | - |_____________________________________________| - -C1 -- "COM9026 - SMC 8638" - In a chip socket. - -C2 -- "@Copyright - Waterloo Microsystems Inc. - 1985" - In a chip Socket with info printed on a label covering a round window - showing the circuit inside. (The window indicates it is an EPROM chip.) - -C3 -- "COM9032 - SMC 8643" - In a chip socket. - -C4 -- "74LS" - 9 total no sockets. - -M5 -- "50006-136 - 20.000000 MHZ - MTQ-T1-S3 - 0 M-TRON 86-40" - Metallic case with 4 pins, no socket. - -C6 -- "MOSTEK@TC8643 - MK6116N-20 - MALAYSIA" - No socket. - -C7 -- No stamp or label but in a 20 pin chip socket. - -C8 -- "PAL10L8CN - 8623" - In a 20 pin socket. - -C9 -- "PAl16R4A-2CN - 8641" - In a 20 pin socket. - -C10 -- "M8640 - NMC - 9306N" - In an 8 pin socket. - -?? -- Some components on a smaller board and attached with 20 pins all - along the side closest to the BNC connector. The are coated in a dark - resin. - -On the board there are two jumper banks labeled J2 and J3. The -manufacturer didn't put a J1 on the board. The two boards I have both -came with a jumper box for each bank. - -J2 -- Numbered 1 2 3 4 5 6. - 4 and 5 are not stamped due to solder points. - -J3 -- IRQ 2 3 4 5 6 7 - -The board itself has a maple leaf stamped just above the irq jumpers -and "-2 46-86" beside C2. Between C1 and C6 "ASS 'Y 300163" and "@1986 -CORMAN CUSTOM ELECTRONICS CORP." stamped just below the BNC connector. -Below that "MADE IN CANADA" - - -***************************************************************************** - -** No Name ** -8-bit cards, 16-bit cards -------------------------- - - from Juergen Seifert - -NONAME 8-BIT ARCNET -=================== - -I have named this ARCnet card "NONAME", since there is no name of any -manufacturer on the Installation manual nor on the shipping box. The only -hint to the existence of a manufacturer at all is written in copper, -it is "Made in Taiwan" - -This description has been written by Juergen Seifert -using information from the Original - "ARCnet Installation Manual" - - - ________________________________________________________________ - | |STAR| BUS| T/P| | - | |____|____|____| | - | _____________________ | - | | | | - | | | | - | | | | - | | SMC | | - | | | | - | | COM90C65 | | - | | | | - | | | | - | |__________-__________| | - | _____| - | _______________ | CN | - | | PROM | |_____| - | > SOCKET | | - | |_______________| 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 | - | _______________ _______________ | - | |o|o|o|o|o|o|o|o| | SW1 || SW2 || - | |o|o|o|o|o|o|o|o| |_______________||_______________|| - |___ 2 3 4 5 7 E E R Node ID IOB__|__MEM____| - | \ IRQ / T T O | - |__________________1_2_M______________________| - -Legend: - -COM90C65: ARCnet Probe -S1 1-8: Node ID Select -S2 1-3: I/O Base Address Select - 4-6: Memory Base Address Select - 7-8: RAM Offset Select -ET1, ET2 Extended Timeout Select -ROM ROM Enable Select -CN RG62 Coax Connector -STAR| BUS | T/P Three fields for placing a sign (colored circle) - indicating the topology of the card - -Setting one of the switches to Off means "1", On means "0". - - -Setting the Node ID -------------------- - -The eight switches in group SW1 are used to set the node ID. -Each node attached to the network must have an unique node ID which -must be different from 0. -Switch 8 serves as the least significant bit (LSB). - -The node ID is the sum of the values of all switches set to "1" -These values are: - - Switch | Value - -------|------- - 8 | 1 - 7 | 2 - 6 | 4 - 5 | 8 - 4 | 16 - 3 | 32 - 2 | 64 - 1 | 128 - -Some Examples: - - Switch | Hex | Decimal - 1 2 3 4 5 6 7 8 | Node ID | Node ID - ----------------|---------|--------- - 0 0 0 0 0 0 0 0 | not allowed - 0 0 0 0 0 0 0 1 | 1 | 1 - 0 0 0 0 0 0 1 0 | 2 | 2 - 0 0 0 0 0 0 1 1 | 3 | 3 - . . . | | - 0 1 0 1 0 1 0 1 | 55 | 85 - . . . | | - 1 0 1 0 1 0 1 0 | AA | 170 - . . . | | - 1 1 1 1 1 1 0 1 | FD | 253 - 1 1 1 1 1 1 1 0 | FE | 254 - 1 1 1 1 1 1 1 1 | FF | 255 - - -Setting the I/O Base Address ----------------------------- - -The first three switches in switch group SW2 are used to select one -of eight possible I/O Base addresses using the following table - - Switch | Hex I/O - 1 2 3 | Address - ------------|-------- - ON ON ON | 260 - ON ON OFF | 290 - ON OFF ON | 2E0 (Manufacturer's default) - ON OFF OFF | 2F0 - OFF ON ON | 300 - OFF ON OFF | 350 - OFF OFF ON | 380 - OFF OFF OFF | 3E0 - - -Setting the Base Memory (RAM) buffer Address --------------------------------------------- - -The memory buffer requires 2K of a 16K block of RAM. The base of this -16K block can be located in any of eight positions. -Switches 4-6 of switch group SW2 select the Base of the 16K block. -Within that 16K address space, the buffer may be assigned any one of four -positions, determined by the offset, switches 7 and 8 of group SW2. - - Switch | Hex RAM | Hex ROM - 4 5 6 7 8 | Address | Address *) - -----------|---------|----------- - 0 0 0 0 0 | C0000 | C2000 - 0 0 0 0 1 | C0800 | C2000 - 0 0 0 1 0 | C1000 | C2000 - 0 0 0 1 1 | C1800 | C2000 - | | - 0 0 1 0 0 | C4000 | C6000 - 0 0 1 0 1 | C4800 | C6000 - 0 0 1 1 0 | C5000 | C6000 - 0 0 1 1 1 | C5800 | C6000 - | | - 0 1 0 0 0 | CC000 | CE000 - 0 1 0 0 1 | CC800 | CE000 - 0 1 0 1 0 | CD000 | CE000 - 0 1 0 1 1 | CD800 | CE000 - | | - 0 1 1 0 0 | D0000 | D2000 (Manufacturer's default) - 0 1 1 0 1 | D0800 | D2000 - 0 1 1 1 0 | D1000 | D2000 - 0 1 1 1 1 | D1800 | D2000 - | | - 1 0 0 0 0 | D4000 | D6000 - 1 0 0 0 1 | D4800 | D6000 - 1 0 0 1 0 | D5000 | D6000 - 1 0 0 1 1 | D5800 | D6000 - | | - 1 0 1 0 0 | D8000 | DA000 - 1 0 1 0 1 | D8800 | DA000 - 1 0 1 1 0 | D9000 | DA000 - 1 0 1 1 1 | D9800 | DA000 - | | - 1 1 0 0 0 | DC000 | DE000 - 1 1 0 0 1 | DC800 | DE000 - 1 1 0 1 0 | DD000 | DE000 - 1 1 0 1 1 | DD800 | DE000 - | | - 1 1 1 0 0 | E0000 | E2000 - 1 1 1 0 1 | E0800 | E2000 - 1 1 1 1 0 | E1000 | E2000 - 1 1 1 1 1 | E1800 | E2000 - -*) To enable the 8K Boot PROM install the jumper ROM. - The default is jumper ROM not installed. - - -Setting Interrupt Request Lines (IRQ) -------------------------------------- - -To select a hardware interrupt level set one (only one!) of the jumpers -IRQ2, IRQ3, IRQ4, IRQ5 or IRQ7. The manufacturer's default is IRQ2. - - -Setting the Timeouts --------------------- - -The two jumpers labeled ET1 and ET2 are used to determine the timeout -parameters (response and reconfiguration time). Every node in a network -must be set to the same timeout values. - - ET1 ET2 | Response Time (us) | Reconfiguration Time (ms) - --------|--------------------|-------------------------- - Off Off | 78 | 840 (Default) - Off On | 285 | 1680 - On Off | 563 | 1680 - On On | 1130 | 1680 - -On means jumper installed, Off means jumper not installed - - -NONAME 16-BIT ARCNET -==================== - -The manual of my 8-Bit NONAME ARCnet Card contains another description -of a 16-Bit Coax / Twisted Pair Card. This description is incomplete, -because there are missing two pages in the manual booklet. (The table -of contents reports pages ... 2-9, 2-11, 2-12, 3-1, ... but inside -the booklet there is a different way of counting ... 2-9, 2-10, A-1, -(empty page), 3-1, ..., 3-18, A-1 (again), A-2) -Also the picture of the board layout is not as good as the picture of -8-Bit card, because there isn't any letter like "SW1" written to the -picture. -Should somebody have such a board, please feel free to complete this -description or to send a mail to me! - -This description has been written by Juergen Seifert -using information from the Original - "ARCnet Installation Manual" - - - ___________________________________________________________________ - < _________________ _________________ | - > | SW? || SW? | | - < |_________________||_________________| | - > ____________________ | - < | | | - > | | | - < | | | - > | | | - < | | | - > | | | - < | | | - > |____________________| | - < ____| - > ____________________ | | - < | | | J1 | - > | < | | - < |____________________| ? ? ? ? ? ? |____| - > |o|o|o|o|o|o| | - < |o|o|o|o|o|o| | - > | - < __ ___________| - > | | | - <____________| |_______________________________________| - - -Setting one of the switches to Off means "1", On means "0". - - -Setting the Node ID -------------------- - -The eight switches in group SW2 are used to set the node ID. -Each node attached to the network must have an unique node ID which -must be different from 0. -Switch 8 serves as the least significant bit (LSB). - -The node ID is the sum of the values of all switches set to "1" -These values are: - - Switch | Value - -------|------- - 8 | 1 - 7 | 2 - 6 | 4 - 5 | 8 - 4 | 16 - 3 | 32 - 2 | 64 - 1 | 128 - -Some Examples: - - Switch | Hex | Decimal - 1 2 3 4 5 6 7 8 | Node ID | Node ID - ----------------|---------|--------- - 0 0 0 0 0 0 0 0 | not allowed - 0 0 0 0 0 0 0 1 | 1 | 1 - 0 0 0 0 0 0 1 0 | 2 | 2 - 0 0 0 0 0 0 1 1 | 3 | 3 - . . . | | - 0 1 0 1 0 1 0 1 | 55 | 85 - . . . | | - 1 0 1 0 1 0 1 0 | AA | 170 - . . . | | - 1 1 1 1 1 1 0 1 | FD | 253 - 1 1 1 1 1 1 1 0 | FE | 254 - 1 1 1 1 1 1 1 1 | FF | 255 - - -Setting the I/O Base Address ----------------------------- - -The first three switches in switch group SW1 are used to select one -of eight possible I/O Base addresses using the following table - - Switch | Hex I/O - 3 2 1 | Address - ------------|-------- - ON ON ON | 260 - ON ON OFF | 290 - ON OFF ON | 2E0 (Manufacturer's default) - ON OFF OFF | 2F0 - OFF ON ON | 300 - OFF ON OFF | 350 - OFF OFF ON | 380 - OFF OFF OFF | 3E0 - - -Setting the Base Memory (RAM) buffer Address --------------------------------------------- - -The memory buffer requires 2K of a 16K block of RAM. The base of this -16K block can be located in any of eight positions. -Switches 6-8 of switch group SW1 select the Base of the 16K block. -Within that 16K address space, the buffer may be assigned any one of four -positions, determined by the offset, switches 4 and 5 of group SW1. - - Switch | Hex RAM | Hex ROM - 8 7 6 5 4 | Address | Address - -----------|---------|----------- - 0 0 0 0 0 | C0000 | C2000 - 0 0 0 0 1 | C0800 | C2000 - 0 0 0 1 0 | C1000 | C2000 - 0 0 0 1 1 | C1800 | C2000 - | | - 0 0 1 0 0 | C4000 | C6000 - 0 0 1 0 1 | C4800 | C6000 - 0 0 1 1 0 | C5000 | C6000 - 0 0 1 1 1 | C5800 | C6000 - | | - 0 1 0 0 0 | CC000 | CE000 - 0 1 0 0 1 | CC800 | CE000 - 0 1 0 1 0 | CD000 | CE000 - 0 1 0 1 1 | CD800 | CE000 - | | - 0 1 1 0 0 | D0000 | D2000 (Manufacturer's default) - 0 1 1 0 1 | D0800 | D2000 - 0 1 1 1 0 | D1000 | D2000 - 0 1 1 1 1 | D1800 | D2000 - | | - 1 0 0 0 0 | D4000 | D6000 - 1 0 0 0 1 | D4800 | D6000 - 1 0 0 1 0 | D5000 | D6000 - 1 0 0 1 1 | D5800 | D6000 - | | - 1 0 1 0 0 | D8000 | DA000 - 1 0 1 0 1 | D8800 | DA000 - 1 0 1 1 0 | D9000 | DA000 - 1 0 1 1 1 | D9800 | DA000 - | | - 1 1 0 0 0 | DC000 | DE000 - 1 1 0 0 1 | DC800 | DE000 - 1 1 0 1 0 | DD000 | DE000 - 1 1 0 1 1 | DD800 | DE000 - | | - 1 1 1 0 0 | E0000 | E2000 - 1 1 1 0 1 | E0800 | E2000 - 1 1 1 1 0 | E1000 | E2000 - 1 1 1 1 1 | E1800 | E2000 - - -Setting Interrupt Request Lines (IRQ) -------------------------------------- - -?????????????????????????????????????? - - -Setting the Timeouts --------------------- - -?????????????????????????????????????? - - -***************************************************************************** - -** No Name ** -8-bit cards ("Made in Taiwan R.O.C.") ------------ - - from Vojtech Pavlik - -I have named this ARCnet card "NONAME", since I got only the card with -no manual at all and the only text identifying the manufacturer is -"MADE IN TAIWAN R.O.C" printed on the card. - - ____________________________________________________________ - | 1 2 3 4 5 6 7 8 | - | |o|o| JP1 o|o|o|o|o|o|o|o| ON | - | + o|o|o|o|o|o|o|o| ___| - | _____________ o|o|o|o|o|o|o|o| OFF _____ | | ID7 - | | | SW1 | | | | ID6 - | > RAM (2k) | ____________________ | H | | S | ID5 - | |_____________| | || y | | W | ID4 - | | || b | | 2 | ID3 - | | || r | | | ID2 - | | || i | | | ID1 - | | 90C65 || d | |___| ID0 - | SW3 | || | | - | |o|o|o|o|o|o|o|o| ON | || I | | - | |o|o|o|o|o|o|o|o| | || C | | - | |o|o|o|o|o|o|o|o| OFF |____________________|| | _____| - | 1 2 3 4 5 6 7 8 | | | |___ - | ______________ | | | BNC |___| - | | | |_____| |_____| - | > EPROM SOCKET | | - | |______________| | - | ______________| - | | - |_____________________________________________| - -Legend: - -90C65 ARCNET Chip -SW1 1-5: Base Memory Address Select - 6-8: Base I/O Address Select -SW2 1-8: Node ID Select (ID0-ID7) -SW3 1-5: IRQ Select - 6-7: Extra Timeout - 8 : ROM Enable -JP1 Led connector -BNC Coax connector - -Although the jumpers SW1 and SW3 are marked SW, not JP, they are jumpers, not -switches. - -Setting the jumpers to ON means connecting the upper two pins, off the bottom -two - or - in case of IRQ setting, connecting none of them at all. - -Setting the Node ID -------------------- - -The eight switches in SW2 are used to set the node ID. Each node attached -to the network must have an unique node ID which must not be 0. -Switch 1 (ID0) serves as the least significant bit (LSB). - -Setting one of the switches to Off means "1", On means "0". - -The node ID is the sum of the values of all switches set to "1" -These values are: - - Switch | Label | Value - -------|-------|------- - 1 | ID0 | 1 - 2 | ID1 | 2 - 3 | ID2 | 4 - 4 | ID3 | 8 - 5 | ID4 | 16 - 6 | ID5 | 32 - 7 | ID6 | 64 - 8 | ID7 | 128 - -Some Examples: - - Switch | Hex | Decimal - 8 7 6 5 4 3 2 1 | Node ID | Node ID - ----------------|---------|--------- - 0 0 0 0 0 0 0 0 | not allowed - 0 0 0 0 0 0 0 1 | 1 | 1 - 0 0 0 0 0 0 1 0 | 2 | 2 - 0 0 0 0 0 0 1 1 | 3 | 3 - . . . | | - 0 1 0 1 0 1 0 1 | 55 | 85 - . . . | | - 1 0 1 0 1 0 1 0 | AA | 170 - . . . | | - 1 1 1 1 1 1 0 1 | FD | 253 - 1 1 1 1 1 1 1 0 | FE | 254 - 1 1 1 1 1 1 1 1 | FF | 255 - - -Setting the I/O Base Address ----------------------------- - -The last three switches in switch block SW1 are used to select one -of eight possible I/O Base addresses using the following table - - - Switch | Hex I/O - 6 7 8 | Address - ------------|-------- - ON ON ON | 260 - OFF ON ON | 290 - ON OFF ON | 2E0 (Manufacturer's default) - OFF OFF ON | 2F0 - ON ON OFF | 300 - OFF ON OFF | 350 - ON OFF OFF | 380 - OFF OFF OFF | 3E0 - - -Setting the Base Memory (RAM) buffer Address --------------------------------------------- - -The memory buffer (RAM) requires 2K. The base of this buffer can be -located in any of eight positions. The address of the Boot Prom is -memory base + 0x2000. -Jumpers 3-5 of jumper block SW1 select the Memory Base address. - - Switch | Hex RAM | Hex ROM - 1 2 3 4 5 | Address | Address *) - --------------------|---------|----------- - ON ON ON ON ON | C0000 | C2000 - ON ON OFF ON ON | C4000 | C6000 - ON ON ON OFF ON | CC000 | CE000 - ON ON OFF OFF ON | D0000 | D2000 (Manufacturer's default) - ON ON ON ON OFF | D4000 | D6000 - ON ON OFF ON OFF | D8000 | DA000 - ON ON ON OFF OFF | DC000 | DE000 - ON ON OFF OFF OFF | E0000 | E2000 - -*) To enable the Boot ROM set the jumper 8 of jumper block SW3 to position ON. - -The jumpers 1 and 2 probably add 0x0800, 0x1000 and 0x1800 to RAM adders. - -Setting the Interrupt Line --------------------------- - -Jumpers 1-5 of the jumper block SW3 control the IRQ level. - - Jumper | IRQ - 1 2 3 4 5 | - ---------------------------- - ON OFF OFF OFF OFF | 2 - OFF ON OFF OFF OFF | 3 - OFF OFF ON OFF OFF | 4 - OFF OFF OFF ON OFF | 5 - OFF OFF OFF OFF ON | 7 - - -Setting the Timeout Parameters ------------------------------- - -The jumpers 6-7 of the jumper block SW3 are used to determine the timeout -parameters. These two jumpers are normally left in the OFF position. - - -***************************************************************************** - -** No Name ** -(Generic Model 9058) --------------------- - - from Andrew J. Kroll - - Sorry this sat in my to-do box for so long, Andrew! (yikes - over a - year!) - _____ - | < - | .---' - ________________________________________________________________ | | - | | SW2 | | | - | ___________ |_____________| | | - | | | 1 2 3 4 5 6 ___| | - | > 6116 RAM | _________ 8 | | | - | |___________| |20MHzXtal| 7 | | | - | |_________| __________ 6 | S | | - | 74LS373 | |- 5 | W | | - | _________ | E |- 4 | | | - | >_______| ______________|..... P |- 3 | 3 | | - | | | : O |- 2 | | | - | | | : X |- 1 |___| | - | ________________ | | : Y |- | | - | | SW1 | | SL90C65 | : |- | | - | |________________| | | : B |- | | - | 1 2 3 4 5 6 7 8 | | : O |- | | - | |_________o____|..../ A |- _______| | - | ____________________ | R |- | |------, - | | | | D |- | BNC | # | - | > 2764 PROM SOCKET | |__________|- |_______|------' - | |____________________| _________ | | - | >________| <- 74LS245 | | - | | | - |___ ______________| | - |H H H H H H H H H H H H H H H H H H H H H H H| | | - |U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U| | | - \| -Legend: - -SL90C65 ARCNET Controller / Transceiver /Logic -SW1 1-5: IRQ Select - 6: ET1 - 7: ET2 - 8: ROM ENABLE -SW2 1-3: Memory Buffer/PROM Address - 3-6: I/O Address Map -SW3 1-8: Node ID Select -BNC BNC RG62/U Connection - *I* have had success using RG59B/U with *NO* terminators! - What gives?! - -SW1: Timeouts, Interrupt and ROM ---------------------------------- - -To select a hardware interrupt level set one (only one!) of the dip switches -up (on) SW1...(switches 1-5) -IRQ3, IRQ4, IRQ5, IRQ7, IRQ2. The Manufacturer's default is IRQ2. - -The switches on SW1 labeled EXT1 (switch 6) and EXT2 (switch 7) -are used to determine the timeout parameters. These two dip switches -are normally left off (down). - - To enable the 8K Boot PROM position SW1 switch 8 on (UP) labeled ROM. - The default is jumper ROM not installed. - - -Setting the I/O Base Address ----------------------------- - -The last three switches in switch group SW2 are used to select one -of eight possible I/O Base addresses using the following table - - - Switch | Hex I/O - 4 5 6 | Address - -------|-------- - 0 0 0 | 260 - 0 0 1 | 290 - 0 1 0 | 2E0 (Manufacturer's default) - 0 1 1 | 2F0 - 1 0 0 | 300 - 1 0 1 | 350 - 1 1 0 | 380 - 1 1 1 | 3E0 - - -Setting the Base Memory Address (RAM & ROM) -------------------------------------------- - -The memory buffer requires 2K of a 16K block of RAM. The base of this -16K block can be located in any of eight positions. -Switches 1-3 of switch group SW2 select the Base of the 16K block. -(0 = DOWN, 1 = UP) -I could, however, only verify two settings... - - Switch| Hex RAM | Hex ROM - 1 2 3 | Address | Address - ------|---------|----------- - 0 0 0 | E0000 | E2000 - 0 0 1 | D0000 | D2000 (Manufacturer's default) - 0 1 0 | ????? | ????? - 0 1 1 | ????? | ????? - 1 0 0 | ????? | ????? - 1 0 1 | ????? | ????? - 1 1 0 | ????? | ????? - 1 1 1 | ????? | ????? - - -Setting the Node ID -------------------- - -The eight switches in group SW3 are used to set the node ID. -Each node attached to the network must have an unique node ID which -must be different from 0. -Switch 1 serves as the least significant bit (LSB). -switches in the DOWN position are OFF (0) and in the UP position are ON (1) - -The node ID is the sum of the values of all switches set to "1" -These values are: - Switch | Value - -------|------- - 1 | 1 - 2 | 2 - 3 | 4 - 4 | 8 - 5 | 16 - 6 | 32 - 7 | 64 - 8 | 128 - -Some Examples: - - Switch# | Hex | Decimal -8 7 6 5 4 3 2 1 | Node ID | Node ID -----------------|---------|--------- -0 0 0 0 0 0 0 0 | not allowed <-. -0 0 0 0 0 0 0 1 | 1 | 1 | -0 0 0 0 0 0 1 0 | 2 | 2 | -0 0 0 0 0 0 1 1 | 3 | 3 | - . . . | | | -0 1 0 1 0 1 0 1 | 55 | 85 | - . . . | | + Don't use 0 or 255! -1 0 1 0 1 0 1 0 | AA | 170 | - . . . | | | -1 1 1 1 1 1 0 1 | FD | 253 | -1 1 1 1 1 1 1 0 | FE | 254 | -1 1 1 1 1 1 1 1 | FF | 255 <-' - - -***************************************************************************** - -** Tiara ** -(model unknown) -------------------------- - - from Christoph Lameter - - -Here is information about my card as far as I could figure it out: ------------------------------------------------ tiara -Tiara LanCard of Tiara Computer Systems. - -+----------------------------------------------+ -! ! Transmitter Unit ! ! -! +------------------+ ------- -! MEM Coax Connector -! ROM 7654321 <- I/O ------- -! : : +--------+ ! -! : : ! 90C66LJ! +++ -! : : ! ! !D Switch to set -! : : ! ! !I the Nodenumber -! : : +--------+ !P -! !++ -! 234567 <- IRQ ! -+------------!!!!!!!!!!!!!!!!!!!!!!!!--------+ - !!!!!!!!!!!!!!!!!!!!!!!! - -0 = Jumper Installed -1 = Open - -Top Jumper line Bit 7 = ROM Enable 654=Memory location 321=I/O - -Settings for Memory Location (Top Jumper Line) -456 Address selected -000 C0000 -001 C4000 -010 CC000 -011 D0000 -100 D4000 -101 D8000 -110 DC000 -111 E0000 - -Settings for I/O Address (Top Jumper Line) -123 Port -000 260 -001 290 -010 2E0 -011 2F0 -100 300 -101 350 -110 380 -111 3E0 - -Settings for IRQ Selection (Lower Jumper Line) -234567 -011111 IRQ 2 -101111 IRQ 3 -110111 IRQ 4 -111011 IRQ 5 -111110 IRQ 7 - -***************************************************************************** - - -Other Cards ------------ - -I have no information on other models of ARCnet cards at the moment. Please -send any and all info to: - apenwarr@worldvisions.ca - -Thanks. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 96ffad845fd9..5da18e024fcb 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -39,6 +39,7 @@ Contents: 6lowpan 6pack altera_tse + arcnet-hardware .. only:: subproject and html -- cgit v1.2.3 From 08bab46f00d0f0fe9709a05b7cdfe909a4258b01 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:20 +0200 Subject: docs: networking: convert arcnet.txt to ReST - add SPDX header; - use document title markup; - add notes markups; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/arcnet.rst | 594 ++++++++++++++++++++++++++++++++++++ Documentation/networking/arcnet.txt | 556 --------------------------------- Documentation/networking/index.rst | 1 + drivers/net/arcnet/Kconfig | 6 +- 4 files changed, 598 insertions(+), 559 deletions(-) create mode 100644 Documentation/networking/arcnet.rst delete mode 100644 Documentation/networking/arcnet.txt (limited to 'Documentation') diff --git a/Documentation/networking/arcnet.rst b/Documentation/networking/arcnet.rst new file mode 100644 index 000000000000..e93d9820f0f1 --- /dev/null +++ b/Documentation/networking/arcnet.rst @@ -0,0 +1,594 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====== +ARCnet +====== + +.. note:: + + See also arcnet-hardware.txt in this directory for jumper-setting + and cabling information if you're like many of us and didn't happen to get a + manual with your ARCnet card. + +Since no one seems to listen to me otherwise, perhaps a poem will get your +attention:: + + This driver's getting fat and beefy, + But my cat is still named Fifi. + +Hmm, I think I'm allowed to call that a poem, even though it's only two +lines. Hey, I'm in Computer Science, not English. Give me a break. + +The point is: I REALLY REALLY REALLY REALLY REALLY want to hear from you if +you test this and get it working. Or if you don't. Or anything. + +ARCnet 0.32 ALPHA first made it into the Linux kernel 1.1.80 - this was +nice, but after that even FEWER people started writing to me because they +didn't even have to install the patch. + +Come on, be a sport! Send me a success report! + +(hey, that was even better than my original poem... this is getting bad!) + + +.. warning:: + + If you don't e-mail me about your success/failure soon, I may be forced to + start SINGING. And we don't want that, do we? + + (You know, it might be argued that I'm pushing this point a little too much. + If you think so, why not flame me in a quick little e-mail? Please also + include the type of card(s) you're using, software, size of network, and + whether it's working or not.) + + My e-mail address is: apenwarr@worldvisions.ca + +These are the ARCnet drivers for Linux. + +This new release (2.91) has been put together by David Woodhouse +, in an attempt to tidy up the driver after adding support +for yet another chipset. Now the generic support has been separated from the +individual chipset drivers, and the source files aren't quite so packed with +#ifdefs! I've changed this file a bit, but kept it in the first person from +Avery, because I didn't want to completely rewrite it. + +The previous release resulted from many months of on-and-off effort from me +(Avery Pennarun), many bug reports/fixes and suggestions from others, and in +particular a lot of input and coding from Tomasz Motylewski. Starting with +ARCnet 2.10 ALPHA, Tomasz's all-new-and-improved RFC1051 support has been +included and seems to be working fine! + + +Where do I discuss these drivers? +--------------------------------- + +Tomasz has been so kind as to set up a new and improved mailing list. +Subscribe by sending a message with the BODY "subscribe linux-arcnet YOUR +REAL NAME" to listserv@tichy.ch.uj.edu.pl. Then, to submit messages to the +list, mail to linux-arcnet@tichy.ch.uj.edu.pl. + +There are archives of the mailing list at: + + http://epistolary.org/mailman/listinfo.cgi/arcnet + +The people on linux-net@vger.kernel.org (now defunct, replaced by +netdev@vger.kernel.org) have also been known to be very helpful, especially +when we're talking about ALPHA Linux kernels that may or may not work right +in the first place. + + +Other Drivers and Info +---------------------- + +You can try my ARCNET page on the World Wide Web at: + + http://www.qis.net/~jschmitz/arcnet/ + +Also, SMC (one of the companies that makes ARCnet cards) has a WWW site you +might be interested in, which includes several drivers for various cards +including ARCnet. Try: + + http://www.smc.com/ + +Performance Technologies makes various network software that supports +ARCnet: + + http://www.perftech.com/ or ftp to ftp.perftech.com. + +Novell makes a networking stack for DOS which includes ARCnet drivers. Try +FTPing to ftp.novell.com. + +You can get the Crynwr packet driver collection (including arcether.com, the +one you'll want to use with ARCnet cards) from +oak.oakland.edu:/simtel/msdos/pktdrvr. It won't work perfectly on a 386+ +without patches, though, and also doesn't like several cards. Fixed +versions are available on my WWW page, or via e-mail if you don't have WWW +access. + + +Installing the Driver +--------------------- + +All you will need to do in order to install the driver is:: + + make config + (be sure to choose ARCnet in the network devices + and at least one chipset driver.) + make clean + make zImage + +If you obtained this ARCnet package as an upgrade to the ARCnet driver in +your current kernel, you will need to first copy arcnet.c over the one in +the linux/drivers/net directory. + +You will know the driver is installed properly if you get some ARCnet +messages when you reboot into the new Linux kernel. + +There are four chipset options: + + 1. Standard ARCnet COM90xx chipset. + +This is the normal ARCnet card, which you've probably got. This is the only +chipset driver which will autoprobe if not told where the card is. +It following options on the command line:: + + com90xx=[[,[,]]][,] | + +If you load the chipset support as a module, the options are:: + + io= irq= shmem= device= + +To disable the autoprobe, just specify "com90xx=" on the kernel command line. +To specify the name alone, but allow autoprobe, just put "com90xx=" + + 2. ARCnet COM20020 chipset. + +This is the new chipset from SMC with support for promiscuous mode (packet +sniffing), extra diagnostic information, etc. Unfortunately, there is no +sensible method of autoprobing for these cards. You must specify the I/O +address on the kernel command line. + +The command line options are:: + + com20020=[,[,[,backplane[,CKP[,timeout]]]]][,name] + +If you load the chipset support as a module, the options are:: + + io= irq= node= backplane= clock= + timeout= device= + +The COM20020 chipset allows you to set the node ID in software, overriding the +default which is still set in DIP switches on the card. If you don't have the +COM20020 data sheets, and you don't know what the other three options refer +to, then they won't interest you - forget them. + + 3. ARCnet COM90xx chipset in IO-mapped mode. + +This will also work with the normal ARCnet cards, but doesn't use the shared +memory. It performs less well than the above driver, but is provided in case +you have a card which doesn't support shared memory, or (strangely) in case +you have so many ARCnet cards in your machine that you run out of shmem slots. +If you don't give the IO address on the kernel command line, then the driver +will not find the card. + +The command line options are:: + + com90io=[,][,] + +If you load the chipset support as a module, the options are: + io= irq= device= + + 4. ARCnet RIM I cards. + +These are COM90xx chips which are _completely_ memory mapped. The support for +these is not tested. If you have one, please mail the author with a success +report. All options must be specified, except the device name. +Command line options:: + + arcrimi=,,[,] + +If you load the chipset support as a module, the options are:: + + shmem= irq= node= device= + + +Loadable Module Support +----------------------- + +Configure and rebuild Linux. When asked, answer 'm' to "Generic ARCnet +support" and to support for your ARCnet chipset if you want to use the +loadable module. You can also say 'y' to "Generic ARCnet support" and 'm' +to the chipset support if you wish. + +:: + + make config + make clean + make zImage + make modules + +If you're using a loadable module, you need to use insmod to load it, and +you can specify various characteristics of your card on the command +line. (In recent versions of the driver, autoprobing is much more reliable +and works as a module, so most of this is now unnecessary.) + +For example:: + + cd /usr/src/linux/modules + insmod arcnet.o + insmod com90xx.o + insmod com20020.o io=0x2e0 device=eth1 + + +Using the Driver +---------------- + +If you build your kernel with ARCnet COM90xx support included, it should +probe for your card automatically when you boot. If you use a different +chipset driver complied into the kernel, you must give the necessary options +on the kernel command line, as detailed above. + +Go read the NET-2-HOWTO and ETHERNET-HOWTO for Linux; they should be +available where you picked up this driver. Think of your ARCnet as a +souped-up (or down, as the case may be) Ethernet card. + +By the way, be sure to change all references from "eth0" to "arc0" in the +HOWTOs. Remember that ARCnet isn't a "true" Ethernet, and the device name +is DIFFERENT. + + +Multiple Cards in One Computer +------------------------------ + +Linux has pretty good support for this now, but since I've been busy, the +ARCnet driver has somewhat suffered in this respect. COM90xx support, if +compiled into the kernel, will (try to) autodetect all the installed cards. + +If you have other cards, with support compiled into the kernel, then you can +just repeat the options on the kernel command line, e.g.:: + + LILO: linux com20020=0x2e0 com20020=0x380 com90io=0x260 + +If you have the chipset support built as a loadable module, then you need to +do something like this:: + + insmod -o arc0 com90xx + insmod -o arc1 com20020 io=0x2e0 + insmod -o arc2 com90xx + +The ARCnet drivers will now sort out their names automatically. + + +How do I get it to work with...? +-------------------------------- + +NFS: + Should be fine linux->linux, just pretend you're using Ethernet cards. + oak.oakland.edu:/simtel/msdos/nfs has some nice DOS clients. There + is also a DOS-based NFS server called SOSS. It doesn't multitask + quite the way Linux does (actually, it doesn't multitask AT ALL) but + you never know what you might need. + + With AmiTCP (and possibly others), you may need to set the following + options in your Amiga nfstab: MD 1024 MR 1024 MW 1024 + (Thanks to Christian Gottschling + for this.) + + Probably these refer to maximum NFS data/read/write block sizes. I + don't know why the defaults on the Amiga didn't work; write to me if + you know more. + +DOS: + If you're using the freeware arcether.com, you might want to install + the driver patch from my web page. It helps with PC/TCP, and also + can get arcether to load if it timed out too quickly during + initialization. In fact, if you use it on a 386+ you REALLY need + the patch, really. + +Windows: + See DOS :) Trumpet Winsock works fine with either the Novell or + Arcether client, assuming you remember to load winpkt of course. + +LAN Manager and Windows for Workgroups: + These programs use protocols that + are incompatible with the Internet standard. They try to pretend + the cards are Ethernet, and confuse everyone else on the network. + + However, v2.00 and higher of the Linux ARCnet driver supports this + protocol via the 'arc0e' device. See the section on "Multiprotocol + Support" for more information. + + Using the freeware Samba server and clients for Linux, you can now + interface quite nicely with TCP/IP-based WfWg or Lan Manager + networks. + +Windows 95: + Tools are included with Win95 that let you use either the LANMAN + style network drivers (NDIS) or Novell drivers (ODI) to handle your + ARCnet packets. If you use ODI, you'll need to use the 'arc0' + device with Linux. If you use NDIS, then try the 'arc0e' device. + See the "Multiprotocol Support" section below if you need arc0e, + you're completely insane, and/or you need to build some kind of + hybrid network that uses both encapsulation types. + +OS/2: + I've been told it works under Warp Connect with an ARCnet driver from + SMC. You need to use the 'arc0e' interface for this. If you get + the SMC driver to work with the TCP/IP stuff included in the + "normal" Warp Bonus Pack, let me know. + + ftp.microsoft.com also has a freeware "Lan Manager for OS/2" client + which should use the same protocol as WfWg does. I had no luck + installing it under Warp, however. Please mail me with any results. + +NetBSD/AmiTCP: + These use an old version of the Internet standard ARCnet + protocol (RFC1051) which is compatible with the Linux driver v2.10 + ALPHA and above using the arc0s device. (See "Multiprotocol ARCnet" + below.) ** Newer versions of NetBSD apparently support RFC1201. + + +Using Multiprotocol ARCnet +-------------------------- + +The ARCnet driver v2.10 ALPHA supports three protocols, each on its own +"virtual network device": + + ====== =============================================================== + arc0 RFC1201 protocol, the official Internet standard which just + happens to be 100% compatible with Novell's TRXNET driver. + Version 1.00 of the ARCnet driver supported _only_ this + protocol. arc0 is the fastest of the three protocols (for + whatever reason), and allows larger packets to be used + because it supports RFC1201 "packet splitting" operations. + Unless you have a specific need to use a different protocol, + I strongly suggest that you stick with this one. + + arc0e "Ethernet-Encapsulation" which sends packets over ARCnet + that are actually a lot like Ethernet packets, including the + 6-byte hardware addresses. This protocol is compatible with + Microsoft's NDIS ARCnet driver, like the one in WfWg and + LANMAN. Because the MTU of 493 is actually smaller than the + one "required" by TCP/IP (576), there is a chance that some + network operations will not function properly. The Linux + TCP/IP layer can compensate in most cases, however, by + automatically fragmenting the TCP/IP packets to make them + fit. arc0e also works slightly more slowly than arc0, for + reasons yet to be determined. (Probably it's the smaller + MTU that does it.) + + arc0s The "[s]imple" RFC1051 protocol is the "previous" Internet + standard that is completely incompatible with the new + standard. Some software today, however, continues to + support the old standard (and only the old standard) + including NetBSD and AmiTCP. RFC1051 also does not support + RFC1201's packet splitting, and the MTU of 507 is still + smaller than the Internet "requirement," so it's quite + possible that you may run into problems. It's also slower + than RFC1201 by about 25%, for the same reason as arc0e. + + The arc0s support was contributed by Tomasz Motylewski + and modified somewhat by me. Bugs are probably my fault. + ====== =============================================================== + +You can choose not to compile arc0e and arc0s into the driver if you want - +this will save you a bit of memory and avoid confusion when eg. trying to +use the "NFS-root" stuff in recent Linux kernels. + +The arc0e and arc0s devices are created automatically when you first +ifconfig the arc0 device. To actually use them, though, you need to also +ifconfig the other virtual devices you need. There are a number of ways you +can set up your network then: + + +1. Single Protocol. + + This is the simplest way to configure your network: use just one of the + two available protocols. As mentioned above, it's a good idea to use + only arc0 unless you have a good reason (like some other software, ie. + WfWg, that only works with arc0e). + + If you need only arc0, then the following commands should get you going:: + + ifconfig arc0 MY.IP.ADD.RESS + route add MY.IP.ADD.RESS arc0 + route add -net SUB.NET.ADD.RESS arc0 + [add other local routes here] + + If you need arc0e (and only arc0e), it's a little different:: + + ifconfig arc0 MY.IP.ADD.RESS + ifconfig arc0e MY.IP.ADD.RESS + route add MY.IP.ADD.RESS arc0e + route add -net SUB.NET.ADD.RESS arc0e + + arc0s works much the same way as arc0e. + + +2. More than one protocol on the same wire. + + Now things start getting confusing. To even try it, you may need to be + partly crazy. Here's what *I* did. :) Note that I don't include arc0s in + my home network; I don't have any NetBSD or AmiTCP computers, so I only + use arc0s during limited testing. + + I have three computers on my home network; two Linux boxes (which prefer + RFC1201 protocol, for reasons listed above), and one XT that can't run + Linux but runs the free Microsoft LANMAN Client instead. + + Worse, one of the Linux computers (freedom) also has a modem and acts as + a router to my Internet provider. The other Linux box (insight) also has + its own IP address and needs to use freedom as its default gateway. The + XT (patience), however, does not have its own Internet IP address and so + I assigned it one on a "private subnet" (as defined by RFC1597). + + To start with, take a simple network with just insight and freedom. + Insight needs to: + + - talk to freedom via RFC1201 (arc0) protocol, because I like it + more and it's faster. + - use freedom as its Internet gateway. + + That's pretty easy to do. Set up insight like this:: + + ifconfig arc0 insight + route add insight arc0 + route add freedom arc0 /* I would use the subnet here (like I said + to to in "single protocol" above), + but the rest of the subnet + unfortunately lies across the PPP + link on freedom, which confuses + things. */ + route add default gw freedom + + And freedom gets configured like so:: + + ifconfig arc0 freedom + route add freedom arc0 + route add insight arc0 + /* and default gateway is configured by pppd */ + + Great, now insight talks to freedom directly on arc0, and sends packets + to the Internet through freedom. If you didn't know how to do the above, + you should probably stop reading this section now because it only gets + worse. + + Now, how do I add patience into the network? It will be using LANMAN + Client, which means I need the arc0e device. It needs to be able to talk + to both insight and freedom, and also use freedom as a gateway to the + Internet. (Recall that patience has a "private IP address" which won't + work on the Internet; that's okay, I configured Linux IP masquerading on + freedom for this subnet). + + So patience (necessarily; I don't have another IP number from my + provider) has an IP address on a different subnet than freedom and + insight, but needs to use freedom as an Internet gateway. Worse, most + DOS networking programs, including LANMAN, have braindead networking + schemes that rely completely on the netmask and a 'default gateway' to + determine how to route packets. This means that to get to freedom or + insight, patience WILL send through its default gateway, regardless of + the fact that both freedom and insight (courtesy of the arc0e device) + could understand a direct transmission. + + I compensate by giving freedom an extra IP address - aliased 'gatekeeper' - + that is on my private subnet, the same subnet that patience is on. I + then define gatekeeper to be the default gateway for patience. + + To configure freedom (in addition to the commands above):: + + ifconfig arc0e gatekeeper + route add gatekeeper arc0e + route add patience arc0e + + This way, freedom will send all packets for patience through arc0e, + giving its IP address as gatekeeper (on the private subnet). When it + talks to insight or the Internet, it will use its "freedom" Internet IP + address. + + You will notice that we haven't configured the arc0e device on insight. + This would work, but is not really necessary, and would require me to + assign insight another special IP number from my private subnet. Since + both insight and patience are using freedom as their default gateway, the + two can already talk to each other. + + It's quite fortunate that I set things up like this the first time (cough + cough) because it's really handy when I boot insight into DOS. There, it + runs the Novell ODI protocol stack, which only works with RFC1201 ARCnet. + In this mode it would be impossible for insight to communicate directly + with patience, since the Novell stack is incompatible with Microsoft's + Ethernet-Encap. Without changing any settings on freedom or patience, I + simply set freedom as the default gateway for insight (now in DOS, + remember) and all the forwarding happens "automagically" between the two + hosts that would normally not be able to communicate at all. + + For those who like diagrams, I have created two "virtual subnets" on the + same physical ARCnet wire. You can picture it like this:: + + + [RFC1201 NETWORK] [ETHER-ENCAP NETWORK] + (registered Internet subnet) (RFC1597 private subnet) + + (IP Masquerade) + /---------------\ * /---------------\ + | | * | | + | +-Freedom-*-Gatekeeper-+ | + | | | * | | + \-------+-------/ | * \-------+-------/ + | | | + Insight | Patience + (Internet) + + + +It works: what now? +------------------- + +Send mail describing your setup, preferably including driver version, kernel +version, ARCnet card model, CPU type, number of systems on your network, and +list of software in use to me at the following address: + + apenwarr@worldvisions.ca + +I do send (sometimes automated) replies to all messages I receive. My email +can be weird (and also usually gets forwarded all over the place along the +way to me), so if you don't get a reply within a reasonable time, please +resend. + + +It doesn't work: what now? +-------------------------- + +Do the same as above, but also include the output of the ifconfig and route +commands, as well as any pertinent log entries (ie. anything that starts +with "arcnet:" and has shown up since the last reboot) in your mail. + +If you want to try fixing it yourself (I strongly recommend that you mail me +about the problem first, since it might already have been solved) you may +want to try some of the debug levels available. For heavy testing on +D_DURING or more, it would be a REALLY good idea to kill your klogd daemon +first! D_DURING displays 4-5 lines for each packet sent or received. D_TX, +D_RX, and D_SKB actually DISPLAY each packet as it is sent or received, +which is obviously quite big. + +Starting with v2.40 ALPHA, the autoprobe routines have changed +significantly. In particular, they won't tell you why the card was not +found unless you turn on the D_INIT_REASONS debugging flag. + +Once the driver is running, you can run the arcdump shell script (available +from me or in the full ARCnet package, if you have it) as root to list the +contents of the arcnet buffers at any time. To make any sense at all out of +this, you should grab the pertinent RFCs. (some are listed near the top of +arcnet.c). arcdump assumes your card is at 0xD0000. If it isn't, edit the +script. + +Buffers 0 and 1 are used for receiving, and Buffers 2 and 3 are for sending. +Ping-pong buffers are implemented both ways. + +If your debug level includes D_DURING and you did NOT define SLOW_XMIT_COPY, +the buffers are cleared to a constant value of 0x42 every time the card is +reset (which should only happen when you do an ifconfig up, or when Linux +decides that the driver is broken). During a transmit, unused parts of the +buffer will be cleared to 0x42 as well. This is to make it easier to figure +out which bytes are being used by a packet. + +You can change the debug level without recompiling the kernel by typing:: + + ifconfig arc0 down metric 1xxx + /etc/rc.d/rc.inet1 + +where "xxx" is the debug level you want. For example, "metric 1015" would put +you at debug level 15. Debug level 7 is currently the default. + +Note that the debug level is (starting with v1.90 ALPHA) a binary +combination of different debug flags; so debug level 7 is really 1+2+4 or +D_NORMAL+D_EXTRA+D_INIT. To include D_DURING, you would add 16 to this, +resulting in debug level 23. + +If you don't understand that, you probably don't want to know anyway. +E-mail me about your problem. + + +I want to send money: what now? +------------------------------- + +Go take a nap or something. You'll feel better in the morning. diff --git a/Documentation/networking/arcnet.txt b/Documentation/networking/arcnet.txt deleted file mode 100644 index aff97f47c05c..000000000000 --- a/Documentation/networking/arcnet.txt +++ /dev/null @@ -1,556 +0,0 @@ ----------------------------------------------------------------------------- -NOTE: See also arcnet-hardware.txt in this directory for jumper-setting -and cabling information if you're like many of us and didn't happen to get a -manual with your ARCnet card. ----------------------------------------------------------------------------- - -Since no one seems to listen to me otherwise, perhaps a poem will get your -attention: - This driver's getting fat and beefy, - But my cat is still named Fifi. - -Hmm, I think I'm allowed to call that a poem, even though it's only two -lines. Hey, I'm in Computer Science, not English. Give me a break. - -The point is: I REALLY REALLY REALLY REALLY REALLY want to hear from you if -you test this and get it working. Or if you don't. Or anything. - -ARCnet 0.32 ALPHA first made it into the Linux kernel 1.1.80 - this was -nice, but after that even FEWER people started writing to me because they -didn't even have to install the patch. - -Come on, be a sport! Send me a success report! - -(hey, that was even better than my original poem... this is getting bad!) - - --------- -WARNING: --------- - -If you don't e-mail me about your success/failure soon, I may be forced to -start SINGING. And we don't want that, do we? - -(You know, it might be argued that I'm pushing this point a little too much. -If you think so, why not flame me in a quick little e-mail? Please also -include the type of card(s) you're using, software, size of network, and -whether it's working or not.) - -My e-mail address is: apenwarr@worldvisions.ca - - ---------------------------------------------------------------------------- - - -These are the ARCnet drivers for Linux. - - -This new release (2.91) has been put together by David Woodhouse -, in an attempt to tidy up the driver after adding support -for yet another chipset. Now the generic support has been separated from the -individual chipset drivers, and the source files aren't quite so packed with -#ifdefs! I've changed this file a bit, but kept it in the first person from -Avery, because I didn't want to completely rewrite it. - -The previous release resulted from many months of on-and-off effort from me -(Avery Pennarun), many bug reports/fixes and suggestions from others, and in -particular a lot of input and coding from Tomasz Motylewski. Starting with -ARCnet 2.10 ALPHA, Tomasz's all-new-and-improved RFC1051 support has been -included and seems to be working fine! - - -Where do I discuss these drivers? ---------------------------------- - -Tomasz has been so kind as to set up a new and improved mailing list. -Subscribe by sending a message with the BODY "subscribe linux-arcnet YOUR -REAL NAME" to listserv@tichy.ch.uj.edu.pl. Then, to submit messages to the -list, mail to linux-arcnet@tichy.ch.uj.edu.pl. - -There are archives of the mailing list at: - http://epistolary.org/mailman/listinfo.cgi/arcnet - -The people on linux-net@vger.kernel.org (now defunct, replaced by -netdev@vger.kernel.org) have also been known to be very helpful, especially -when we're talking about ALPHA Linux kernels that may or may not work right -in the first place. - - -Other Drivers and Info ----------------------- - -You can try my ARCNET page on the World Wide Web at: - http://www.qis.net/~jschmitz/arcnet/ - -Also, SMC (one of the companies that makes ARCnet cards) has a WWW site you -might be interested in, which includes several drivers for various cards -including ARCnet. Try: - http://www.smc.com/ - -Performance Technologies makes various network software that supports -ARCnet: - http://www.perftech.com/ or ftp to ftp.perftech.com. - -Novell makes a networking stack for DOS which includes ARCnet drivers. Try -FTPing to ftp.novell.com. - -You can get the Crynwr packet driver collection (including arcether.com, the -one you'll want to use with ARCnet cards) from -oak.oakland.edu:/simtel/msdos/pktdrvr. It won't work perfectly on a 386+ -without patches, though, and also doesn't like several cards. Fixed -versions are available on my WWW page, or via e-mail if you don't have WWW -access. - - -Installing the Driver ---------------------- - -All you will need to do in order to install the driver is: - make config - (be sure to choose ARCnet in the network devices - and at least one chipset driver.) - make clean - make zImage - -If you obtained this ARCnet package as an upgrade to the ARCnet driver in -your current kernel, you will need to first copy arcnet.c over the one in -the linux/drivers/net directory. - -You will know the driver is installed properly if you get some ARCnet -messages when you reboot into the new Linux kernel. - -There are four chipset options: - - 1. Standard ARCnet COM90xx chipset. - -This is the normal ARCnet card, which you've probably got. This is the only -chipset driver which will autoprobe if not told where the card is. -It following options on the command line: - com90xx=[[,[,]]][,] | - -If you load the chipset support as a module, the options are: - io= irq= shmem= device= - -To disable the autoprobe, just specify "com90xx=" on the kernel command line. -To specify the name alone, but allow autoprobe, just put "com90xx=" - - 2. ARCnet COM20020 chipset. - -This is the new chipset from SMC with support for promiscuous mode (packet -sniffing), extra diagnostic information, etc. Unfortunately, there is no -sensible method of autoprobing for these cards. You must specify the I/O -address on the kernel command line. -The command line options are: - com20020=[,[,[,backplane[,CKP[,timeout]]]]][,name] - -If you load the chipset support as a module, the options are: - io= irq= node= backplane= clock= - timeout= device= - -The COM20020 chipset allows you to set the node ID in software, overriding the -default which is still set in DIP switches on the card. If you don't have the -COM20020 data sheets, and you don't know what the other three options refer -to, then they won't interest you - forget them. - - 3. ARCnet COM90xx chipset in IO-mapped mode. - -This will also work with the normal ARCnet cards, but doesn't use the shared -memory. It performs less well than the above driver, but is provided in case -you have a card which doesn't support shared memory, or (strangely) in case -you have so many ARCnet cards in your machine that you run out of shmem slots. -If you don't give the IO address on the kernel command line, then the driver -will not find the card. -The command line options are: - com90io=[,][,] - -If you load the chipset support as a module, the options are: - io= irq= device= - - 4. ARCnet RIM I cards. - -These are COM90xx chips which are _completely_ memory mapped. The support for -these is not tested. If you have one, please mail the author with a success -report. All options must be specified, except the device name. -Command line options: - arcrimi=,,[,] - -If you load the chipset support as a module, the options are: - shmem= irq= node= device= - - -Loadable Module Support ------------------------ - -Configure and rebuild Linux. When asked, answer 'm' to "Generic ARCnet -support" and to support for your ARCnet chipset if you want to use the -loadable module. You can also say 'y' to "Generic ARCnet support" and 'm' -to the chipset support if you wish. - - make config - make clean - make zImage - make modules - -If you're using a loadable module, you need to use insmod to load it, and -you can specify various characteristics of your card on the command -line. (In recent versions of the driver, autoprobing is much more reliable -and works as a module, so most of this is now unnecessary.) - -For example: - cd /usr/src/linux/modules - insmod arcnet.o - insmod com90xx.o - insmod com20020.o io=0x2e0 device=eth1 - - -Using the Driver ----------------- - -If you build your kernel with ARCnet COM90xx support included, it should -probe for your card automatically when you boot. If you use a different -chipset driver complied into the kernel, you must give the necessary options -on the kernel command line, as detailed above. - -Go read the NET-2-HOWTO and ETHERNET-HOWTO for Linux; they should be -available where you picked up this driver. Think of your ARCnet as a -souped-up (or down, as the case may be) Ethernet card. - -By the way, be sure to change all references from "eth0" to "arc0" in the -HOWTOs. Remember that ARCnet isn't a "true" Ethernet, and the device name -is DIFFERENT. - - -Multiple Cards in One Computer ------------------------------- - -Linux has pretty good support for this now, but since I've been busy, the -ARCnet driver has somewhat suffered in this respect. COM90xx support, if -compiled into the kernel, will (try to) autodetect all the installed cards. - -If you have other cards, with support compiled into the kernel, then you can -just repeat the options on the kernel command line, e.g.: -LILO: linux com20020=0x2e0 com20020=0x380 com90io=0x260 - -If you have the chipset support built as a loadable module, then you need to -do something like this: - insmod -o arc0 com90xx - insmod -o arc1 com20020 io=0x2e0 - insmod -o arc2 com90xx -The ARCnet drivers will now sort out their names automatically. - - -How do I get it to work with...? --------------------------------- - -NFS: Should be fine linux->linux, just pretend you're using Ethernet cards. - oak.oakland.edu:/simtel/msdos/nfs has some nice DOS clients. There - is also a DOS-based NFS server called SOSS. It doesn't multitask - quite the way Linux does (actually, it doesn't multitask AT ALL) but - you never know what you might need. - - With AmiTCP (and possibly others), you may need to set the following - options in your Amiga nfstab: MD 1024 MR 1024 MW 1024 - (Thanks to Christian Gottschling - for this.) - - Probably these refer to maximum NFS data/read/write block sizes. I - don't know why the defaults on the Amiga didn't work; write to me if - you know more. - -DOS: If you're using the freeware arcether.com, you might want to install - the driver patch from my web page. It helps with PC/TCP, and also - can get arcether to load if it timed out too quickly during - initialization. In fact, if you use it on a 386+ you REALLY need - the patch, really. - -Windows: See DOS :) Trumpet Winsock works fine with either the Novell or - Arcether client, assuming you remember to load winpkt of course. - -LAN Manager and Windows for Workgroups: These programs use protocols that - are incompatible with the Internet standard. They try to pretend - the cards are Ethernet, and confuse everyone else on the network. - - However, v2.00 and higher of the Linux ARCnet driver supports this - protocol via the 'arc0e' device. See the section on "Multiprotocol - Support" for more information. - - Using the freeware Samba server and clients for Linux, you can now - interface quite nicely with TCP/IP-based WfWg or Lan Manager - networks. - -Windows 95: Tools are included with Win95 that let you use either the LANMAN - style network drivers (NDIS) or Novell drivers (ODI) to handle your - ARCnet packets. If you use ODI, you'll need to use the 'arc0' - device with Linux. If you use NDIS, then try the 'arc0e' device. - See the "Multiprotocol Support" section below if you need arc0e, - you're completely insane, and/or you need to build some kind of - hybrid network that uses both encapsulation types. - -OS/2: I've been told it works under Warp Connect with an ARCnet driver from - SMC. You need to use the 'arc0e' interface for this. If you get - the SMC driver to work with the TCP/IP stuff included in the - "normal" Warp Bonus Pack, let me know. - - ftp.microsoft.com also has a freeware "Lan Manager for OS/2" client - which should use the same protocol as WfWg does. I had no luck - installing it under Warp, however. Please mail me with any results. - -NetBSD/AmiTCP: These use an old version of the Internet standard ARCnet - protocol (RFC1051) which is compatible with the Linux driver v2.10 - ALPHA and above using the arc0s device. (See "Multiprotocol ARCnet" - below.) ** Newer versions of NetBSD apparently support RFC1201. - - -Using Multiprotocol ARCnet --------------------------- - -The ARCnet driver v2.10 ALPHA supports three protocols, each on its own -"virtual network device": - - arc0 - RFC1201 protocol, the official Internet standard which just - happens to be 100% compatible with Novell's TRXNET driver. - Version 1.00 of the ARCnet driver supported _only_ this - protocol. arc0 is the fastest of the three protocols (for - whatever reason), and allows larger packets to be used - because it supports RFC1201 "packet splitting" operations. - Unless you have a specific need to use a different protocol, - I strongly suggest that you stick with this one. - - arc0e - "Ethernet-Encapsulation" which sends packets over ARCnet - that are actually a lot like Ethernet packets, including the - 6-byte hardware addresses. This protocol is compatible with - Microsoft's NDIS ARCnet driver, like the one in WfWg and - LANMAN. Because the MTU of 493 is actually smaller than the - one "required" by TCP/IP (576), there is a chance that some - network operations will not function properly. The Linux - TCP/IP layer can compensate in most cases, however, by - automatically fragmenting the TCP/IP packets to make them - fit. arc0e also works slightly more slowly than arc0, for - reasons yet to be determined. (Probably it's the smaller - MTU that does it.) - - arc0s - The "[s]imple" RFC1051 protocol is the "previous" Internet - standard that is completely incompatible with the new - standard. Some software today, however, continues to - support the old standard (and only the old standard) - including NetBSD and AmiTCP. RFC1051 also does not support - RFC1201's packet splitting, and the MTU of 507 is still - smaller than the Internet "requirement," so it's quite - possible that you may run into problems. It's also slower - than RFC1201 by about 25%, for the same reason as arc0e. - - The arc0s support was contributed by Tomasz Motylewski - and modified somewhat by me. Bugs are probably my fault. - -You can choose not to compile arc0e and arc0s into the driver if you want - -this will save you a bit of memory and avoid confusion when eg. trying to -use the "NFS-root" stuff in recent Linux kernels. - -The arc0e and arc0s devices are created automatically when you first -ifconfig the arc0 device. To actually use them, though, you need to also -ifconfig the other virtual devices you need. There are a number of ways you -can set up your network then: - - -1. Single Protocol. - - This is the simplest way to configure your network: use just one of the - two available protocols. As mentioned above, it's a good idea to use - only arc0 unless you have a good reason (like some other software, ie. - WfWg, that only works with arc0e). - - If you need only arc0, then the following commands should get you going: - ifconfig arc0 MY.IP.ADD.RESS - route add MY.IP.ADD.RESS arc0 - route add -net SUB.NET.ADD.RESS arc0 - [add other local routes here] - - If you need arc0e (and only arc0e), it's a little different: - ifconfig arc0 MY.IP.ADD.RESS - ifconfig arc0e MY.IP.ADD.RESS - route add MY.IP.ADD.RESS arc0e - route add -net SUB.NET.ADD.RESS arc0e - - arc0s works much the same way as arc0e. - - -2. More than one protocol on the same wire. - - Now things start getting confusing. To even try it, you may need to be - partly crazy. Here's what *I* did. :) Note that I don't include arc0s in - my home network; I don't have any NetBSD or AmiTCP computers, so I only - use arc0s during limited testing. - - I have three computers on my home network; two Linux boxes (which prefer - RFC1201 protocol, for reasons listed above), and one XT that can't run - Linux but runs the free Microsoft LANMAN Client instead. - - Worse, one of the Linux computers (freedom) also has a modem and acts as - a router to my Internet provider. The other Linux box (insight) also has - its own IP address and needs to use freedom as its default gateway. The - XT (patience), however, does not have its own Internet IP address and so - I assigned it one on a "private subnet" (as defined by RFC1597). - - To start with, take a simple network with just insight and freedom. - Insight needs to: - - talk to freedom via RFC1201 (arc0) protocol, because I like it - more and it's faster. - - use freedom as its Internet gateway. - - That's pretty easy to do. Set up insight like this: - ifconfig arc0 insight - route add insight arc0 - route add freedom arc0 /* I would use the subnet here (like I said - to to in "single protocol" above), - but the rest of the subnet - unfortunately lies across the PPP - link on freedom, which confuses - things. */ - route add default gw freedom - - And freedom gets configured like so: - ifconfig arc0 freedom - route add freedom arc0 - route add insight arc0 - /* and default gateway is configured by pppd */ - - Great, now insight talks to freedom directly on arc0, and sends packets - to the Internet through freedom. If you didn't know how to do the above, - you should probably stop reading this section now because it only gets - worse. - - Now, how do I add patience into the network? It will be using LANMAN - Client, which means I need the arc0e device. It needs to be able to talk - to both insight and freedom, and also use freedom as a gateway to the - Internet. (Recall that patience has a "private IP address" which won't - work on the Internet; that's okay, I configured Linux IP masquerading on - freedom for this subnet). - - So patience (necessarily; I don't have another IP number from my - provider) has an IP address on a different subnet than freedom and - insight, but needs to use freedom as an Internet gateway. Worse, most - DOS networking programs, including LANMAN, have braindead networking - schemes that rely completely on the netmask and a 'default gateway' to - determine how to route packets. This means that to get to freedom or - insight, patience WILL send through its default gateway, regardless of - the fact that both freedom and insight (courtesy of the arc0e device) - could understand a direct transmission. - - I compensate by giving freedom an extra IP address - aliased 'gatekeeper' - - that is on my private subnet, the same subnet that patience is on. I - then define gatekeeper to be the default gateway for patience. - - To configure freedom (in addition to the commands above): - ifconfig arc0e gatekeeper - route add gatekeeper arc0e - route add patience arc0e - - This way, freedom will send all packets for patience through arc0e, - giving its IP address as gatekeeper (on the private subnet). When it - talks to insight or the Internet, it will use its "freedom" Internet IP - address. - - You will notice that we haven't configured the arc0e device on insight. - This would work, but is not really necessary, and would require me to - assign insight another special IP number from my private subnet. Since - both insight and patience are using freedom as their default gateway, the - two can already talk to each other. - - It's quite fortunate that I set things up like this the first time (cough - cough) because it's really handy when I boot insight into DOS. There, it - runs the Novell ODI protocol stack, which only works with RFC1201 ARCnet. - In this mode it would be impossible for insight to communicate directly - with patience, since the Novell stack is incompatible with Microsoft's - Ethernet-Encap. Without changing any settings on freedom or patience, I - simply set freedom as the default gateway for insight (now in DOS, - remember) and all the forwarding happens "automagically" between the two - hosts that would normally not be able to communicate at all. - - For those who like diagrams, I have created two "virtual subnets" on the - same physical ARCnet wire. You can picture it like this: - - - [RFC1201 NETWORK] [ETHER-ENCAP NETWORK] - (registered Internet subnet) (RFC1597 private subnet) - - (IP Masquerade) - /---------------\ * /---------------\ - | | * | | - | +-Freedom-*-Gatekeeper-+ | - | | | * | | - \-------+-------/ | * \-------+-------/ - | | | - Insight | Patience - (Internet) - - - -It works: what now? -------------------- - -Send mail describing your setup, preferably including driver version, kernel -version, ARCnet card model, CPU type, number of systems on your network, and -list of software in use to me at the following address: - apenwarr@worldvisions.ca - -I do send (sometimes automated) replies to all messages I receive. My email -can be weird (and also usually gets forwarded all over the place along the -way to me), so if you don't get a reply within a reasonable time, please -resend. - - -It doesn't work: what now? --------------------------- - -Do the same as above, but also include the output of the ifconfig and route -commands, as well as any pertinent log entries (ie. anything that starts -with "arcnet:" and has shown up since the last reboot) in your mail. - -If you want to try fixing it yourself (I strongly recommend that you mail me -about the problem first, since it might already have been solved) you may -want to try some of the debug levels available. For heavy testing on -D_DURING or more, it would be a REALLY good idea to kill your klogd daemon -first! D_DURING displays 4-5 lines for each packet sent or received. D_TX, -D_RX, and D_SKB actually DISPLAY each packet as it is sent or received, -which is obviously quite big. - -Starting with v2.40 ALPHA, the autoprobe routines have changed -significantly. In particular, they won't tell you why the card was not -found unless you turn on the D_INIT_REASONS debugging flag. - -Once the driver is running, you can run the arcdump shell script (available -from me or in the full ARCnet package, if you have it) as root to list the -contents of the arcnet buffers at any time. To make any sense at all out of -this, you should grab the pertinent RFCs. (some are listed near the top of -arcnet.c). arcdump assumes your card is at 0xD0000. If it isn't, edit the -script. - -Buffers 0 and 1 are used for receiving, and Buffers 2 and 3 are for sending. -Ping-pong buffers are implemented both ways. - -If your debug level includes D_DURING and you did NOT define SLOW_XMIT_COPY, -the buffers are cleared to a constant value of 0x42 every time the card is -reset (which should only happen when you do an ifconfig up, or when Linux -decides that the driver is broken). During a transmit, unused parts of the -buffer will be cleared to 0x42 as well. This is to make it easier to figure -out which bytes are being used by a packet. - -You can change the debug level without recompiling the kernel by typing: - ifconfig arc0 down metric 1xxx - /etc/rc.d/rc.inet1 -where "xxx" is the debug level you want. For example, "metric 1015" would put -you at debug level 15. Debug level 7 is currently the default. - -Note that the debug level is (starting with v1.90 ALPHA) a binary -combination of different debug flags; so debug level 7 is really 1+2+4 or -D_NORMAL+D_EXTRA+D_INIT. To include D_DURING, you would add 16 to this, -resulting in debug level 23. - -If you don't understand that, you probably don't want to know anyway. -E-mail me about your problem. - - -I want to send money: what now? -------------------------------- - -Go take a nap or something. You'll feel better in the morning. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 5da18e024fcb..3e0a4bb23ef9 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -40,6 +40,7 @@ Contents: 6pack altera_tse arcnet-hardware + arcnet .. only:: subproject and html diff --git a/drivers/net/arcnet/Kconfig b/drivers/net/arcnet/Kconfig index 27551bf3d7e4..43eef60653b2 100644 --- a/drivers/net/arcnet/Kconfig +++ b/drivers/net/arcnet/Kconfig @@ -9,7 +9,7 @@ menuconfig ARCNET ---help--- If you have a network card of this type, say Y and check out the (arguably) beautiful poetry in - . + . You need both this driver, and the driver for the particular ARCnet chipset of your card. If you don't know, then it's probably a @@ -28,7 +28,7 @@ config ARCNET_1201 arc0 device. You need to say Y here to communicate with industry-standard RFC1201 implementations, like the arcether.com packet driver or most DOS/Windows ODI drivers. Please read the - ARCnet documentation in + ARCnet documentation in for more information about using arc0. config ARCNET_1051 @@ -42,7 +42,7 @@ config ARCNET_1051 industry-standard RFC1201 implementations, like the arcether.com packet driver or most DOS/Windows ODI drivers. RFC1201 is included automatically as the arc0 device. Please read the ARCnet - documentation in for more + documentation in for more information about using arc0e and arc0s. config ARCNET_RAW -- cgit v1.2.3 From ff2269f16a1e1a7f8bbe72920d3d285ba3943572 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:21 +0200 Subject: docs: networking: convert atm.txt to ReST There isn't much to be done here. Just: - add SPDX header; - add a document title. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/atm.rst | 14 ++++++++++++++ Documentation/networking/atm.txt | 8 -------- Documentation/networking/index.rst | 1 + net/atm/Kconfig | 2 +- 4 files changed, 16 insertions(+), 9 deletions(-) create mode 100644 Documentation/networking/atm.rst delete mode 100644 Documentation/networking/atm.txt (limited to 'Documentation') diff --git a/Documentation/networking/atm.rst b/Documentation/networking/atm.rst new file mode 100644 index 000000000000..c1df8c038525 --- /dev/null +++ b/Documentation/networking/atm.rst @@ -0,0 +1,14 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=== +ATM +=== + +In order to use anything but the most primitive functions of ATM, +several user-mode programs are required to assist the kernel. These +programs and related material can be found via the ATM on Linux Web +page at http://linux-atm.sourceforge.net/ + +If you encounter problems with ATM, please report them on the ATM +on Linux mailing list. Subscription information, archives, etc., +can be found on http://linux-atm.sourceforge.net/ diff --git a/Documentation/networking/atm.txt b/Documentation/networking/atm.txt deleted file mode 100644 index 82921cee77fe..000000000000 --- a/Documentation/networking/atm.txt +++ /dev/null @@ -1,8 +0,0 @@ -In order to use anything but the most primitive functions of ATM, -several user-mode programs are required to assist the kernel. These -programs and related material can be found via the ATM on Linux Web -page at http://linux-atm.sourceforge.net/ - -If you encounter problems with ATM, please report them on the ATM -on Linux mailing list. Subscription information, archives, etc., -can be found on http://linux-atm.sourceforge.net/ diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 3e0a4bb23ef9..841f3c3905d5 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -41,6 +41,7 @@ Contents: altera_tse arcnet-hardware arcnet + atm .. only:: subproject and html diff --git a/net/atm/Kconfig b/net/atm/Kconfig index 271f682e8438..e61dcc9f85b2 100644 --- a/net/atm/Kconfig +++ b/net/atm/Kconfig @@ -16,7 +16,7 @@ config ATM of your ATM card below. Note that you need a set of user-space programs to actually make use - of ATM. See the file for + of ATM. See the file for further details. config ATM_CLIP -- cgit v1.2.3 From 20b943f075574c233de51fa2f0124a97f0298be1 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:22 +0200 Subject: docs: networking: convert ax25.txt to ReST There isn't much to be done here. Just: - add SPDX header; - add a document title. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/ax25.rst | 16 ++++++++++++++++ Documentation/networking/ax25.txt | 10 ---------- Documentation/networking/index.rst | 1 + net/ax25/Kconfig | 6 +++--- 4 files changed, 20 insertions(+), 13 deletions(-) create mode 100644 Documentation/networking/ax25.rst delete mode 100644 Documentation/networking/ax25.txt (limited to 'Documentation') diff --git a/Documentation/networking/ax25.rst b/Documentation/networking/ax25.rst new file mode 100644 index 000000000000..824afd7002db --- /dev/null +++ b/Documentation/networking/ax25.rst @@ -0,0 +1,16 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===== +AX.25 +===== + +To use the amateur radio protocols within Linux you will need to get a +suitable copy of the AX.25 Utilities. More detailed information about +AX.25, NET/ROM and ROSE, associated programs and and utilities can be +found on http://www.linux-ax25.org. + +There is an active mailing list for discussing Linux amateur radio matters +called linux-hams@vger.kernel.org. To subscribe to it, send a message to +majordomo@vger.kernel.org with the words "subscribe linux-hams" in the body +of the message, the subject field is ignored. You don't need to be +subscribed to post but of course that means you might miss an answer. diff --git a/Documentation/networking/ax25.txt b/Documentation/networking/ax25.txt deleted file mode 100644 index 8257dbf9be57..000000000000 --- a/Documentation/networking/ax25.txt +++ /dev/null @@ -1,10 +0,0 @@ -To use the amateur radio protocols within Linux you will need to get a -suitable copy of the AX.25 Utilities. More detailed information about -AX.25, NET/ROM and ROSE, associated programs and and utilities can be -found on http://www.linux-ax25.org. - -There is an active mailing list for discussing Linux amateur radio matters -called linux-hams@vger.kernel.org. To subscribe to it, send a message to -majordomo@vger.kernel.org with the words "subscribe linux-hams" in the body -of the message, the subject field is ignored. You don't need to be -subscribed to post but of course that means you might miss an answer. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 841f3c3905d5..6a5858b27cf6 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -42,6 +42,7 @@ Contents: arcnet-hardware arcnet atm + ax25 .. only:: subproject and html diff --git a/net/ax25/Kconfig b/net/ax25/Kconfig index 043fd5437809..97d686d115c0 100644 --- a/net/ax25/Kconfig +++ b/net/ax25/Kconfig @@ -40,7 +40,7 @@ config AX25 radio as well as information about how to configure an AX.25 port is contained in the AX25-HOWTO, available from . You might also want to - check out the file in the + check out the file in the kernel source. More information about digital amateur radio in general is on the WWW at . @@ -88,7 +88,7 @@ config NETROM users as well as information about how to configure an AX.25 port is contained in the Linux Ham Wiki, available from . You also might want to check out the - file . More information about + file . More information about digital amateur radio in general is on the WWW at . @@ -107,7 +107,7 @@ config ROSE users as well as information about how to configure an AX.25 port is contained in the Linux Ham Wiki, available from . You also might want to check out the - file . More information about + file . More information about digital amateur radio in general is on the WWW at . -- cgit v1.2.3 From b5fcf32d7d4b647c0f3aa612d91d25996a49bcd9 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:23 +0200 Subject: docs: networking: convert baycom.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/baycom.rst | 174 ++++++++++++++++++++++++++++++++++++ Documentation/networking/baycom.txt | 158 -------------------------------- Documentation/networking/index.rst | 1 + drivers/net/hamradio/Kconfig | 8 +- 4 files changed, 179 insertions(+), 162 deletions(-) create mode 100644 Documentation/networking/baycom.rst delete mode 100644 Documentation/networking/baycom.txt (limited to 'Documentation') diff --git a/Documentation/networking/baycom.rst b/Documentation/networking/baycom.rst new file mode 100644 index 000000000000..fe2d010f0e86 --- /dev/null +++ b/Documentation/networking/baycom.rst @@ -0,0 +1,174 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============================== +Linux Drivers for Baycom Modems +=============================== + +Thomas M. Sailer, HB9JNX/AE4WA, + +The drivers for the baycom modems have been split into +separate drivers as they did not share any code, and the driver +and device names have changed. + +This document describes the Linux Kernel Drivers for simple Baycom style +amateur radio modems. + +The following drivers are available: +==================================== + +baycom_ser_fdx: + This driver supports the SER12 modems either full or half duplex. + Its baud rate may be changed via the ``baud`` module parameter, + therefore it supports just about every bit bang modem on a + serial port. Its devices are called bcsf0 through bcsf3. + This is the recommended driver for SER12 type modems, + however if you have a broken UART clone that does not have working + delta status bits, you may try baycom_ser_hdx. + +baycom_ser_hdx: + This is an alternative driver for SER12 type modems. + It only supports half duplex, and only 1200 baud. Its devices + are called bcsh0 through bcsh3. Use this driver only if baycom_ser_fdx + does not work with your UART. + +baycom_par: + This driver supports the par96 and picpar modems. + Its devices are called bcp0 through bcp3. + +baycom_epp: + This driver supports the EPP modem. + Its devices are called bce0 through bce3. + This driver is work-in-progress. + +The following modems are supported: + +======= ======================================================================== +ser12 This is a very simple 1200 baud AFSK modem. The modem consists only + of a modulator/demodulator chip, usually a TI TCM3105. The computer + is responsible for regenerating the receiver bit clock, as well as + for handling the HDLC protocol. The modem connects to a serial port, + hence the name. Since the serial port is not used as an async serial + port, the kernel driver for serial ports cannot be used, and this + driver only supports standard serial hardware (8250, 16450, 16550) + +par96 This is a modem for 9600 baud FSK compatible to the G3RUH standard. + The modem does all the filtering and regenerates the receiver clock. + Data is transferred from and to the PC via a shift register. + The shift register is filled with 16 bits and an interrupt is signalled. + The PC then empties the shift register in a burst. This modem connects + to the parallel port, hence the name. The modem leaves the + implementation of the HDLC protocol and the scrambler polynomial to + the PC. + +picpar This is a redesign of the par96 modem by Henning Rech, DF9IC. The modem + is protocol compatible to par96, but uses only three low power ICs + and can therefore be fed from the parallel port and does not require + an additional power supply. Furthermore, it incorporates a carrier + detect circuitry. + +EPP This is a high-speed modem adaptor that connects to an enhanced parallel + port. + + Its target audience is users working over a high speed hub (76.8kbit/s). + +eppfpga This is a redesign of the EPP adaptor. +======= ======================================================================== + +All of the above modems only support half duplex communications. However, +the driver supports the KISS (see below) fullduplex command. It then simply +starts to send as soon as there's a packet to transmit and does not care +about DCD, i.e. it starts to send even if there's someone else on the channel. +This command is required by some implementations of the DAMA channel +access protocol. + + +The Interface of the drivers +============================ + +Unlike previous drivers, these drivers are no longer character devices, +but they are now true kernel network interfaces. Installation is therefore +simple. Once installed, four interfaces named bc{sf,sh,p,e}[0-3] are available. +sethdlc from the ax25 utilities may be used to set driver states etc. +Users of userland AX.25 stacks may use the net2kiss utility (also available +in the ax25 utilities package) to convert packets of a network interface +to a KISS stream on a pseudo tty. There's also a patch available from +me for WAMPES which allows attaching a kernel network interface directly. + + +Configuring the driver +====================== + +Every time a driver is inserted into the kernel, it has to know which +modems it should access at which ports. This can be done with the setbaycom +utility. If you are only using one modem, you can also configure the +driver from the insmod command line (or by means of an option line in +``/etc/modprobe.d/*.conf``). + +Examples:: + + modprobe baycom_ser_fdx mode="ser12*" iobase=0x3f8 irq=4 + sethdlc -i bcsf0 -p mode "ser12*" io 0x3f8 irq 4 + +Both lines configure the first port to drive a ser12 modem at the first +serial port (COM1 under DOS). The * in the mode parameter instructs the driver +to use the software DCD algorithm (see below):: + + insmod baycom_par mode="picpar" iobase=0x378 + sethdlc -i bcp0 -p mode "picpar" io 0x378 + +Both lines configure the first port to drive a picpar modem at the +first parallel port (LPT1 under DOS). (Note: picpar implies +hardware DCD, par96 implies software DCD). + +The channel access parameters can be set with sethdlc -a or kissparms. +Note that both utilities interpret the values slightly differently. + + +Hardware DCD versus Software DCD +================================ + +To avoid collisions on the air, the driver must know when the channel is +busy. This is the task of the DCD circuitry/software. The driver may either +utilise a software DCD algorithm (options=1) or use a DCD signal from +the hardware (options=0). + +======= ================================================================= +ser12 if software DCD is utilised, the radio's squelch should always be + open. It is highly recommended to use the software DCD algorithm, + as it is much faster than most hardware squelch circuitry. The + disadvantage is a slightly higher load on the system. + +par96 the software DCD algorithm for this type of modem is rather poor. + The modem simply does not provide enough information to implement + a reasonable DCD algorithm in software. Therefore, if your radio + feeds the DCD input of the PAR96 modem, the use of the hardware + DCD circuitry is recommended. + +picpar the picpar modem features a builtin DCD hardware, which is highly + recommended. +======= ================================================================= + + + +Compatibility with the rest of the Linux kernel +=============================================== + +The serial driver and the baycom serial drivers compete +for the same hardware resources. Of course only one driver can access a given +interface at a time. The serial driver grabs all interfaces it can find at +startup time. Therefore the baycom drivers subsequently won't be able to +access a serial port. You might therefore find it necessary to release +a port owned by the serial driver with 'setserial /dev/ttyS# uart none', where +# is the number of the interface. The baycom drivers do not reserve any +ports at startup, unless one is specified on the 'insmod' command line. Another +method to solve the problem is to compile all drivers as modules and +leave it to kmod to load the correct driver depending on the application. + +The parallel port drivers (baycom_par, baycom_epp) now use the parport subsystem +to arbitrate the ports between different client drivers. + +vy 73s de + +Tom Sailer, sailer@ife.ee.ethz.ch + +hb9jnx @ hb9w.ampr.org diff --git a/Documentation/networking/baycom.txt b/Documentation/networking/baycom.txt deleted file mode 100644 index 688f18fd4467..000000000000 --- a/Documentation/networking/baycom.txt +++ /dev/null @@ -1,158 +0,0 @@ - LINUX DRIVERS FOR BAYCOM MODEMS - - Thomas M. Sailer, HB9JNX/AE4WA, - -!!NEW!! (04/98) The drivers for the baycom modems have been split into -separate drivers as they did not share any code, and the driver -and device names have changed. - -This document describes the Linux Kernel Drivers for simple Baycom style -amateur radio modems. - -The following drivers are available: - -baycom_ser_fdx: - This driver supports the SER12 modems either full or half duplex. - Its baud rate may be changed via the `baud' module parameter, - therefore it supports just about every bit bang modem on a - serial port. Its devices are called bcsf0 through bcsf3. - This is the recommended driver for SER12 type modems, - however if you have a broken UART clone that does not have working - delta status bits, you may try baycom_ser_hdx. - -baycom_ser_hdx: - This is an alternative driver for SER12 type modems. - It only supports half duplex, and only 1200 baud. Its devices - are called bcsh0 through bcsh3. Use this driver only if baycom_ser_fdx - does not work with your UART. - -baycom_par: - This driver supports the par96 and picpar modems. - Its devices are called bcp0 through bcp3. - -baycom_epp: - This driver supports the EPP modem. - Its devices are called bce0 through bce3. - This driver is work-in-progress. - -The following modems are supported: - -ser12: This is a very simple 1200 baud AFSK modem. The modem consists only - of a modulator/demodulator chip, usually a TI TCM3105. The computer - is responsible for regenerating the receiver bit clock, as well as - for handling the HDLC protocol. The modem connects to a serial port, - hence the name. Since the serial port is not used as an async serial - port, the kernel driver for serial ports cannot be used, and this - driver only supports standard serial hardware (8250, 16450, 16550) - -par96: This is a modem for 9600 baud FSK compatible to the G3RUH standard. - The modem does all the filtering and regenerates the receiver clock. - Data is transferred from and to the PC via a shift register. - The shift register is filled with 16 bits and an interrupt is signalled. - The PC then empties the shift register in a burst. This modem connects - to the parallel port, hence the name. The modem leaves the - implementation of the HDLC protocol and the scrambler polynomial to - the PC. - -picpar: This is a redesign of the par96 modem by Henning Rech, DF9IC. The modem - is protocol compatible to par96, but uses only three low power ICs - and can therefore be fed from the parallel port and does not require - an additional power supply. Furthermore, it incorporates a carrier - detect circuitry. - -EPP: This is a high-speed modem adaptor that connects to an enhanced parallel port. - Its target audience is users working over a high speed hub (76.8kbit/s). - -eppfpga: This is a redesign of the EPP adaptor. - - - -All of the above modems only support half duplex communications. However, -the driver supports the KISS (see below) fullduplex command. It then simply -starts to send as soon as there's a packet to transmit and does not care -about DCD, i.e. it starts to send even if there's someone else on the channel. -This command is required by some implementations of the DAMA channel -access protocol. - - -The Interface of the drivers - -Unlike previous drivers, these drivers are no longer character devices, -but they are now true kernel network interfaces. Installation is therefore -simple. Once installed, four interfaces named bc{sf,sh,p,e}[0-3] are available. -sethdlc from the ax25 utilities may be used to set driver states etc. -Users of userland AX.25 stacks may use the net2kiss utility (also available -in the ax25 utilities package) to convert packets of a network interface -to a KISS stream on a pseudo tty. There's also a patch available from -me for WAMPES which allows attaching a kernel network interface directly. - - -Configuring the driver - -Every time a driver is inserted into the kernel, it has to know which -modems it should access at which ports. This can be done with the setbaycom -utility. If you are only using one modem, you can also configure the -driver from the insmod command line (or by means of an option line in -/etc/modprobe.d/*.conf). - -Examples: - modprobe baycom_ser_fdx mode="ser12*" iobase=0x3f8 irq=4 - sethdlc -i bcsf0 -p mode "ser12*" io 0x3f8 irq 4 - -Both lines configure the first port to drive a ser12 modem at the first -serial port (COM1 under DOS). The * in the mode parameter instructs the driver to use -the software DCD algorithm (see below). - - insmod baycom_par mode="picpar" iobase=0x378 - sethdlc -i bcp0 -p mode "picpar" io 0x378 - -Both lines configure the first port to drive a picpar modem at the -first parallel port (LPT1 under DOS). (Note: picpar implies -hardware DCD, par96 implies software DCD). - -The channel access parameters can be set with sethdlc -a or kissparms. -Note that both utilities interpret the values slightly differently. - - -Hardware DCD versus Software DCD - -To avoid collisions on the air, the driver must know when the channel is -busy. This is the task of the DCD circuitry/software. The driver may either -utilise a software DCD algorithm (options=1) or use a DCD signal from -the hardware (options=0). - -ser12: if software DCD is utilised, the radio's squelch should always be - open. It is highly recommended to use the software DCD algorithm, - as it is much faster than most hardware squelch circuitry. The - disadvantage is a slightly higher load on the system. - -par96: the software DCD algorithm for this type of modem is rather poor. - The modem simply does not provide enough information to implement - a reasonable DCD algorithm in software. Therefore, if your radio - feeds the DCD input of the PAR96 modem, the use of the hardware - DCD circuitry is recommended. - -picpar: the picpar modem features a builtin DCD hardware, which is highly - recommended. - - - -Compatibility with the rest of the Linux kernel - -The serial driver and the baycom serial drivers compete -for the same hardware resources. Of course only one driver can access a given -interface at a time. The serial driver grabs all interfaces it can find at -startup time. Therefore the baycom drivers subsequently won't be able to -access a serial port. You might therefore find it necessary to release -a port owned by the serial driver with 'setserial /dev/ttyS# uart none', where -# is the number of the interface. The baycom drivers do not reserve any -ports at startup, unless one is specified on the 'insmod' command line. Another -method to solve the problem is to compile all drivers as modules and -leave it to kmod to load the correct driver depending on the application. - -The parallel port drivers (baycom_par, baycom_epp) now use the parport subsystem -to arbitrate the ports between different client drivers. - -vy 73s de -Tom Sailer, sailer@ife.ee.ethz.ch -hb9jnx @ hb9w.ampr.org diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 6a5858b27cf6..fbf845fbaff7 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -43,6 +43,7 @@ Contents: arcnet atm ax25 + baycom .. only:: subproject and html diff --git a/drivers/net/hamradio/Kconfig b/drivers/net/hamradio/Kconfig index bf306fed04cc..fe409819b56d 100644 --- a/drivers/net/hamradio/Kconfig +++ b/drivers/net/hamradio/Kconfig @@ -127,7 +127,7 @@ config BAYCOM_SER_FDX your serial interface chip. To configure the driver, use the sethdlc utility available in the standard ax25 utilities package. For information on the modems, see and - . + . To compile this driver as a module, choose M here: the module will be called baycom_ser_fdx. This is recommended. @@ -145,7 +145,7 @@ config BAYCOM_SER_HDX the driver, use the sethdlc utility available in the standard ax25 utilities package. For information on the modems, see and - . + . To compile this driver as a module, choose M here: the module will be called baycom_ser_hdx. This is recommended. @@ -160,7 +160,7 @@ config BAYCOM_PAR par96 designs. To configure the driver, use the sethdlc utility available in the standard ax25 utilities package. For information on the modems, see and the file - . + . To compile this driver as a module, choose M here: the module will be called baycom_par. This is recommended. @@ -175,7 +175,7 @@ config BAYCOM_EPP designs. To configure the driver, use the sethdlc utility available in the standard ax25 utilities package. For information on the modems, see and the file - . + . To compile this driver as a module, choose M here: the module will be called baycom_epp. This is recommended. -- cgit v1.2.3 From a362032eca22d03071c4613f6ca503be982bf375 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:24 +0200 Subject: docs: networking: convert bonding.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - comment out text-only TOC from html/pdf output; - mark code blocks and literals as such; - mark tables as such; - add notes markups; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/bonding.rst | 2890 ++++++++++++++++++++ Documentation/networking/bonding.txt | 2837 ------------------- .../networking/device_drivers/intel/e100.rst | 2 +- .../networking/device_drivers/intel/ixgb.rst | 2 +- Documentation/networking/index.rst | 1 + drivers/net/Kconfig | 2 +- 6 files changed, 2894 insertions(+), 2840 deletions(-) create mode 100644 Documentation/networking/bonding.rst delete mode 100644 Documentation/networking/bonding.txt (limited to 'Documentation') diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst new file mode 100644 index 000000000000..dd49f95d28d3 --- /dev/null +++ b/Documentation/networking/bonding.rst @@ -0,0 +1,2890 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================================== +Linux Ethernet Bonding Driver HOWTO +=================================== + +Latest update: 27 April 2011 + +Initial release: Thomas Davis + +Corrections, HA extensions: 2000/10/03-15: + + - Willy Tarreau + - Constantine Gavrilov + - Chad N. Tindel + - Janice Girouard + - Jay Vosburgh + +Reorganized and updated Feb 2005 by Jay Vosburgh +Added Sysfs information: 2006/04/24 + + - Mitch Williams + +Introduction +============ + +The Linux bonding driver provides a method for aggregating +multiple network interfaces into a single logical "bonded" interface. +The behavior of the bonded interfaces depends upon the mode; generally +speaking, modes provide either hot standby or load balancing services. +Additionally, link integrity monitoring may be performed. + +The bonding driver originally came from Donald Becker's +beowulf patches for kernel 2.0. It has changed quite a bit since, and +the original tools from extreme-linux and beowulf sites will not work +with this version of the driver. + +For new versions of the driver, updated userspace tools, and +who to ask for help, please follow the links at the end of this file. + +.. Table of Contents + + 1. Bonding Driver Installation + + 2. Bonding Driver Options + + 3. Configuring Bonding Devices + 3.1 Configuration with Sysconfig Support + 3.1.1 Using DHCP with Sysconfig + 3.1.2 Configuring Multiple Bonds with Sysconfig + 3.2 Configuration with Initscripts Support + 3.2.1 Using DHCP with Initscripts + 3.2.2 Configuring Multiple Bonds with Initscripts + 3.3 Configuring Bonding Manually with Ifenslave + 3.3.1 Configuring Multiple Bonds Manually + 3.4 Configuring Bonding Manually via Sysfs + 3.5 Configuration with Interfaces Support + 3.6 Overriding Configuration for Special Cases + 3.7 Configuring LACP for 802.3ad mode in a more secure way + + 4. Querying Bonding Configuration + 4.1 Bonding Configuration + 4.2 Network Configuration + + 5. Switch Configuration + + 6. 802.1q VLAN Support + + 7. Link Monitoring + 7.1 ARP Monitor Operation + 7.2 Configuring Multiple ARP Targets + 7.3 MII Monitor Operation + + 8. Potential Trouble Sources + 8.1 Adventures in Routing + 8.2 Ethernet Device Renaming + 8.3 Painfully Slow Or No Failed Link Detection By Miimon + + 9. SNMP agents + + 10. Promiscuous mode + + 11. Configuring Bonding for High Availability + 11.1 High Availability in a Single Switch Topology + 11.2 High Availability in a Multiple Switch Topology + 11.2.1 HA Bonding Mode Selection for Multiple Switch Topology + 11.2.2 HA Link Monitoring for Multiple Switch Topology + + 12. Configuring Bonding for Maximum Throughput + 12.1 Maximum Throughput in a Single Switch Topology + 12.1.1 MT Bonding Mode Selection for Single Switch Topology + 12.1.2 MT Link Monitoring for Single Switch Topology + 12.2 Maximum Throughput in a Multiple Switch Topology + 12.2.1 MT Bonding Mode Selection for Multiple Switch Topology + 12.2.2 MT Link Monitoring for Multiple Switch Topology + + 13. Switch Behavior Issues + 13.1 Link Establishment and Failover Delays + 13.2 Duplicated Incoming Packets + + 14. Hardware Specific Considerations + 14.1 IBM BladeCenter + + 15. Frequently Asked Questions + + 16. Resources and Links + + +1. Bonding Driver Installation +============================== + +Most popular distro kernels ship with the bonding driver +already available as a module. If your distro does not, or you +have need to compile bonding from source (e.g., configuring and +installing a mainline kernel from kernel.org), you'll need to perform +the following steps: + +1.1 Configure and build the kernel with bonding +----------------------------------------------- + +The current version of the bonding driver is available in the +drivers/net/bonding subdirectory of the most recent kernel source +(which is available on http://kernel.org). Most users "rolling their +own" will want to use the most recent kernel from kernel.org. + +Configure kernel with "make menuconfig" (or "make xconfig" or +"make config"), then select "Bonding driver support" in the "Network +device support" section. It is recommended that you configure the +driver as module since it is currently the only way to pass parameters +to the driver or configure more than one bonding device. + +Build and install the new kernel and modules. + +1.2 Bonding Control Utility +--------------------------- + +It is recommended to configure bonding via iproute2 (netlink) +or sysfs, the old ifenslave control utility is obsolete. + +2. Bonding Driver Options +========================= + +Options for the bonding driver are supplied as parameters to the +bonding module at load time, or are specified via sysfs. + +Module options may be given as command line arguments to the +insmod or modprobe command, but are usually specified in either the +``/etc/modprobe.d/*.conf`` configuration files, or in a distro-specific +configuration file (some of which are detailed in the next section). + +Details on bonding support for sysfs is provided in the +"Configuring Bonding Manually via Sysfs" section, below. + +The available bonding driver parameters are listed below. If a +parameter is not specified the default value is used. When initially +configuring a bond, it is recommended "tail -f /var/log/messages" be +run in a separate window to watch for bonding driver error messages. + +It is critical that either the miimon or arp_interval and +arp_ip_target parameters be specified, otherwise serious network +degradation will occur during link failures. Very few devices do not +support at least miimon, so there is really no reason not to use it. + +Options with textual values will accept either the text name +or, for backwards compatibility, the option value. E.g., +"mode=802.3ad" and "mode=4" set the same mode. + +The parameters are as follows: + +active_slave + + Specifies the new active slave for modes that support it + (active-backup, balance-alb and balance-tlb). Possible values + are the name of any currently enslaved interface, or an empty + string. If a name is given, the slave and its link must be up in order + to be selected as the new active slave. If an empty string is + specified, the current active slave is cleared, and a new active + slave is selected automatically. + + Note that this is only available through the sysfs interface. No module + parameter by this name exists. + + The normal value of this option is the name of the currently + active slave, or the empty string if there is no active slave or + the current mode does not use an active slave. + +ad_actor_sys_prio + + In an AD system, this specifies the system priority. The allowed range + is 1 - 65535. If the value is not specified, it takes 65535 as the + default value. + + This parameter has effect only in 802.3ad mode and is available through + SysFs interface. + +ad_actor_system + + In an AD system, this specifies the mac-address for the actor in + protocol packet exchanges (LACPDUs). The value cannot be NULL or + multicast. It is preferred to have the local-admin bit set for this + mac but driver does not enforce it. If the value is not given then + system defaults to using the masters' mac address as actors' system + address. + + This parameter has effect only in 802.3ad mode and is available through + SysFs interface. + +ad_select + + Specifies the 802.3ad aggregation selection logic to use. The + possible values and their effects are: + + stable or 0 + + The active aggregator is chosen by largest aggregate + bandwidth. + + Reselection of the active aggregator occurs only when all + slaves of the active aggregator are down or the active + aggregator has no slaves. + + This is the default value. + + bandwidth or 1 + + The active aggregator is chosen by largest aggregate + bandwidth. Reselection occurs if: + + - A slave is added to or removed from the bond + + - Any slave's link state changes + + - Any slave's 802.3ad association state changes + + - The bond's administrative state changes to up + + count or 2 + + The active aggregator is chosen by the largest number of + ports (slaves). Reselection occurs as described under the + "bandwidth" setting, above. + + The bandwidth and count selection policies permit failover of + 802.3ad aggregations when partial failure of the active aggregator + occurs. This keeps the aggregator with the highest availability + (either in bandwidth or in number of ports) active at all times. + + This option was added in bonding version 3.4.0. + +ad_user_port_key + + In an AD system, the port-key has three parts as shown below - + + ===== ============ + Bits Use + ===== ============ + 00 Duplex + 01-05 Speed + 06-15 User-defined + ===== ============ + + This defines the upper 10 bits of the port key. The values can be + from 0 - 1023. If not given, the system defaults to 0. + + This parameter has effect only in 802.3ad mode and is available through + SysFs interface. + +all_slaves_active + + Specifies that duplicate frames (received on inactive ports) should be + dropped (0) or delivered (1). + + Normally, bonding will drop duplicate frames (received on inactive + ports), which is desirable for most users. But there are some times + it is nice to allow duplicate frames to be delivered. + + The default value is 0 (drop duplicate frames received on inactive + ports). + +arp_interval + + Specifies the ARP link monitoring frequency in milliseconds. + + The ARP monitor works by periodically checking the slave + devices to determine whether they have sent or received + traffic recently (the precise criteria depends upon the + bonding mode, and the state of the slave). Regular traffic is + generated via ARP probes issued for the addresses specified by + the arp_ip_target option. + + This behavior can be modified by the arp_validate option, + below. + + If ARP monitoring is used in an etherchannel compatible mode + (modes 0 and 2), the switch should be configured in a mode + that evenly distributes packets across all links. If the + switch is configured to distribute the packets in an XOR + fashion, all replies from the ARP targets will be received on + the same link which could cause the other team members to + fail. ARP monitoring should not be used in conjunction with + miimon. A value of 0 disables ARP monitoring. The default + value is 0. + +arp_ip_target + + Specifies the IP addresses to use as ARP monitoring peers when + arp_interval is > 0. These are the targets of the ARP request + sent to determine the health of the link to the targets. + Specify these values in ddd.ddd.ddd.ddd format. Multiple IP + addresses must be separated by a comma. At least one IP + address must be given for ARP monitoring to function. The + maximum number of targets that can be specified is 16. The + default value is no IP addresses. + +arp_validate + + Specifies whether or not ARP probes and replies should be + validated in any mode that supports arp monitoring, or whether + non-ARP traffic should be filtered (disregarded) for link + monitoring purposes. + + Possible values are: + + none or 0 + + No validation or filtering is performed. + + active or 1 + + Validation is performed only for the active slave. + + backup or 2 + + Validation is performed only for backup slaves. + + all or 3 + + Validation is performed for all slaves. + + filter or 4 + + Filtering is applied to all slaves. No validation is + performed. + + filter_active or 5 + + Filtering is applied to all slaves, validation is performed + only for the active slave. + + filter_backup or 6 + + Filtering is applied to all slaves, validation is performed + only for backup slaves. + + Validation: + + Enabling validation causes the ARP monitor to examine the incoming + ARP requests and replies, and only consider a slave to be up if it + is receiving the appropriate ARP traffic. + + For an active slave, the validation checks ARP replies to confirm + that they were generated by an arp_ip_target. Since backup slaves + do not typically receive these replies, the validation performed + for backup slaves is on the broadcast ARP request sent out via the + active slave. It is possible that some switch or network + configurations may result in situations wherein the backup slaves + do not receive the ARP requests; in such a situation, validation + of backup slaves must be disabled. + + The validation of ARP requests on backup slaves is mainly helping + bonding to decide which slaves are more likely to work in case of + the active slave failure, it doesn't really guarantee that the + backup slave will work if it's selected as the next active slave. + + Validation is useful in network configurations in which multiple + bonding hosts are concurrently issuing ARPs to one or more targets + beyond a common switch. Should the link between the switch and + target fail (but not the switch itself), the probe traffic + generated by the multiple bonding instances will fool the standard + ARP monitor into considering the links as still up. Use of + validation can resolve this, as the ARP monitor will only consider + ARP requests and replies associated with its own instance of + bonding. + + Filtering: + + Enabling filtering causes the ARP monitor to only use incoming ARP + packets for link availability purposes. Arriving packets that are + not ARPs are delivered normally, but do not count when determining + if a slave is available. + + Filtering operates by only considering the reception of ARP + packets (any ARP packet, regardless of source or destination) when + determining if a slave has received traffic for link availability + purposes. + + Filtering is useful in network configurations in which significant + levels of third party broadcast traffic would fool the standard + ARP monitor into considering the links as still up. Use of + filtering can resolve this, as only ARP traffic is considered for + link availability purposes. + + This option was added in bonding version 3.1.0. + +arp_all_targets + + Specifies the quantity of arp_ip_targets that must be reachable + in order for the ARP monitor to consider a slave as being up. + This option affects only active-backup mode for slaves with + arp_validation enabled. + + Possible values are: + + any or 0 + + consider the slave up only when any of the arp_ip_targets + is reachable + + all or 1 + + consider the slave up only when all of the arp_ip_targets + are reachable + +downdelay + + Specifies the time, in milliseconds, to wait before disabling + a slave after a link failure has been detected. This option + is only valid for the miimon link monitor. The downdelay + value should be a multiple of the miimon value; if not, it + will be rounded down to the nearest multiple. The default + value is 0. + +fail_over_mac + + Specifies whether active-backup mode should set all slaves to + the same MAC address at enslavement (the traditional + behavior), or, when enabled, perform special handling of the + bond's MAC address in accordance with the selected policy. + + Possible values are: + + none or 0 + + This setting disables fail_over_mac, and causes + bonding to set all slaves of an active-backup bond to + the same MAC address at enslavement time. This is the + default. + + active or 1 + + The "active" fail_over_mac policy indicates that the + MAC address of the bond should always be the MAC + address of the currently active slave. The MAC + address of the slaves is not changed; instead, the MAC + address of the bond changes during a failover. + + This policy is useful for devices that cannot ever + alter their MAC address, or for devices that refuse + incoming broadcasts with their own source MAC (which + interferes with the ARP monitor). + + The down side of this policy is that every device on + the network must be updated via gratuitous ARP, + vs. just updating a switch or set of switches (which + often takes place for any traffic, not just ARP + traffic, if the switch snoops incoming traffic to + update its tables) for the traditional method. If the + gratuitous ARP is lost, communication may be + disrupted. + + When this policy is used in conjunction with the mii + monitor, devices which assert link up prior to being + able to actually transmit and receive are particularly + susceptible to loss of the gratuitous ARP, and an + appropriate updelay setting may be required. + + follow or 2 + + The "follow" fail_over_mac policy causes the MAC + address of the bond to be selected normally (normally + the MAC address of the first slave added to the bond). + However, the second and subsequent slaves are not set + to this MAC address while they are in a backup role; a + slave is programmed with the bond's MAC address at + failover time (and the formerly active slave receives + the newly active slave's MAC address). + + This policy is useful for multiport devices that + either become confused or incur a performance penalty + when multiple ports are programmed with the same MAC + address. + + + The default policy is none, unless the first slave cannot + change its MAC address, in which case the active policy is + selected by default. + + This option may be modified via sysfs only when no slaves are + present in the bond. + + This option was added in bonding version 3.2.0. The "follow" + policy was added in bonding version 3.3.0. + +lacp_rate + + Option specifying the rate in which we'll ask our link partner + to transmit LACPDU packets in 802.3ad mode. Possible values + are: + + slow or 0 + Request partner to transmit LACPDUs every 30 seconds + + fast or 1 + Request partner to transmit LACPDUs every 1 second + + The default is slow. + +max_bonds + + Specifies the number of bonding devices to create for this + instance of the bonding driver. E.g., if max_bonds is 3, and + the bonding driver is not already loaded, then bond0, bond1 + and bond2 will be created. The default value is 1. Specifying + a value of 0 will load bonding, but will not create any devices. + +miimon + + Specifies the MII link monitoring frequency in milliseconds. + This determines how often the link state of each slave is + inspected for link failures. A value of zero disables MII + link monitoring. A value of 100 is a good starting point. + The use_carrier option, below, affects how the link state is + determined. See the High Availability section for additional + information. The default value is 0. + +min_links + + Specifies the minimum number of links that must be active before + asserting carrier. It is similar to the Cisco EtherChannel min-links + feature. This allows setting the minimum number of member ports that + must be up (link-up state) before marking the bond device as up + (carrier on). This is useful for situations where higher level services + such as clustering want to ensure a minimum number of low bandwidth + links are active before switchover. This option only affect 802.3ad + mode. + + The default value is 0. This will cause carrier to be asserted (for + 802.3ad mode) whenever there is an active aggregator, regardless of the + number of available links in that aggregator. Note that, because an + aggregator cannot be active without at least one available link, + setting this option to 0 or to 1 has the exact same effect. + +mode + + Specifies one of the bonding policies. The default is + balance-rr (round robin). Possible values are: + + balance-rr or 0 + + Round-robin policy: Transmit packets in sequential + order from the first available slave through the + last. This mode provides load balancing and fault + tolerance. + + active-backup or 1 + + Active-backup policy: Only one slave in the bond is + active. A different slave becomes active if, and only + if, the active slave fails. The bond's MAC address is + externally visible on only one port (network adapter) + to avoid confusing the switch. + + In bonding version 2.6.2 or later, when a failover + occurs in active-backup mode, bonding will issue one + or more gratuitous ARPs on the newly active slave. + One gratuitous ARP is issued for the bonding master + interface and each VLAN interfaces configured above + it, provided that the interface has at least one IP + address configured. Gratuitous ARPs issued for VLAN + interfaces are tagged with the appropriate VLAN id. + + This mode provides fault tolerance. The primary + option, documented below, affects the behavior of this + mode. + + balance-xor or 2 + + XOR policy: Transmit based on the selected transmit + hash policy. The default policy is a simple [(source + MAC address XOR'd with destination MAC address XOR + packet type ID) modulo slave count]. Alternate transmit + policies may be selected via the xmit_hash_policy option, + described below. + + This mode provides load balancing and fault tolerance. + + broadcast or 3 + + Broadcast policy: transmits everything on all slave + interfaces. This mode provides fault tolerance. + + 802.3ad or 4 + + IEEE 802.3ad Dynamic link aggregation. Creates + aggregation groups that share the same speed and + duplex settings. Utilizes all slaves in the active + aggregator according to the 802.3ad specification. + + Slave selection for outgoing traffic is done according + to the transmit hash policy, which may be changed from + the default simple XOR policy via the xmit_hash_policy + option, documented below. Note that not all transmit + policies may be 802.3ad compliant, particularly in + regards to the packet mis-ordering requirements of + section 43.2.4 of the 802.3ad standard. Differing + peer implementations will have varying tolerances for + noncompliance. + + Prerequisites: + + 1. Ethtool support in the base drivers for retrieving + the speed and duplex of each slave. + + 2. A switch that supports IEEE 802.3ad Dynamic link + aggregation. + + Most switches will require some type of configuration + to enable 802.3ad mode. + + balance-tlb or 5 + + Adaptive transmit load balancing: channel bonding that + does not require any special switch support. + + In tlb_dynamic_lb=1 mode; the outgoing traffic is + distributed according to the current load (computed + relative to the speed) on each slave. + + In tlb_dynamic_lb=0 mode; the load balancing based on + current load is disabled and the load is distributed + only using the hash distribution. + + Incoming traffic is received by the current slave. + If the receiving slave fails, another slave takes over + the MAC address of the failed receiving slave. + + Prerequisite: + + Ethtool support in the base drivers for retrieving the + speed of each slave. + + balance-alb or 6 + + Adaptive load balancing: includes balance-tlb plus + receive load balancing (rlb) for IPV4 traffic, and + does not require any special switch support. The + receive load balancing is achieved by ARP negotiation. + The bonding driver intercepts the ARP Replies sent by + the local system on their way out and overwrites the + source hardware address with the unique hardware + address of one of the slaves in the bond such that + different peers use different hardware addresses for + the server. + + Receive traffic from connections created by the server + is also balanced. When the local system sends an ARP + Request the bonding driver copies and saves the peer's + IP information from the ARP packet. When the ARP + Reply arrives from the peer, its hardware address is + retrieved and the bonding driver initiates an ARP + reply to this peer assigning it to one of the slaves + in the bond. A problematic outcome of using ARP + negotiation for balancing is that each time that an + ARP request is broadcast it uses the hardware address + of the bond. Hence, peers learn the hardware address + of the bond and the balancing of receive traffic + collapses to the current slave. This is handled by + sending updates (ARP Replies) to all the peers with + their individually assigned hardware address such that + the traffic is redistributed. Receive traffic is also + redistributed when a new slave is added to the bond + and when an inactive slave is re-activated. The + receive load is distributed sequentially (round robin) + among the group of highest speed slaves in the bond. + + When a link is reconnected or a new slave joins the + bond the receive traffic is redistributed among all + active slaves in the bond by initiating ARP Replies + with the selected MAC address to each of the + clients. The updelay parameter (detailed below) must + be set to a value equal or greater than the switch's + forwarding delay so that the ARP Replies sent to the + peers will not be blocked by the switch. + + Prerequisites: + + 1. Ethtool support in the base drivers for retrieving + the speed of each slave. + + 2. Base driver support for setting the hardware + address of a device while it is open. This is + required so that there will always be one slave in the + team using the bond hardware address (the + curr_active_slave) while having a unique hardware + address for each slave in the bond. If the + curr_active_slave fails its hardware address is + swapped with the new curr_active_slave that was + chosen. + +num_grat_arp, +num_unsol_na + + Specify the number of peer notifications (gratuitous ARPs and + unsolicited IPv6 Neighbor Advertisements) to be issued after a + failover event. As soon as the link is up on the new slave + (possibly immediately) a peer notification is sent on the + bonding device and each VLAN sub-device. This is repeated at + the rate specified by peer_notif_delay if the number is + greater than 1. + + The valid range is 0 - 255; the default value is 1. These options + affect only the active-backup mode. These options were added for + bonding versions 3.3.0 and 3.4.0 respectively. + + From Linux 3.0 and bonding version 3.7.1, these notifications + are generated by the ipv4 and ipv6 code and the numbers of + repetitions cannot be set independently. + +packets_per_slave + + Specify the number of packets to transmit through a slave before + moving to the next one. When set to 0 then a slave is chosen at + random. + + The valid range is 0 - 65535; the default value is 1. This option + has effect only in balance-rr mode. + +peer_notif_delay + + Specify the delay, in milliseconds, between each peer + notification (gratuitous ARP and unsolicited IPv6 Neighbor + Advertisement) when they are issued after a failover event. + This delay should be a multiple of the link monitor interval + (arp_interval or miimon, whichever is active). The default + value is 0 which means to match the value of the link monitor + interval. + +primary + + A string (eth0, eth2, etc) specifying which slave is the + primary device. The specified device will always be the + active slave while it is available. Only when the primary is + off-line will alternate devices be used. This is useful when + one slave is preferred over another, e.g., when one slave has + higher throughput than another. + + The primary option is only valid for active-backup(1), + balance-tlb (5) and balance-alb (6) mode. + +primary_reselect + + Specifies the reselection policy for the primary slave. This + affects how the primary slave is chosen to become the active slave + when failure of the active slave or recovery of the primary slave + occurs. This option is designed to prevent flip-flopping between + the primary slave and other slaves. Possible values are: + + always or 0 (default) + + The primary slave becomes the active slave whenever it + comes back up. + + better or 1 + + The primary slave becomes the active slave when it comes + back up, if the speed and duplex of the primary slave is + better than the speed and duplex of the current active + slave. + + failure or 2 + + The primary slave becomes the active slave only if the + current active slave fails and the primary slave is up. + + The primary_reselect setting is ignored in two cases: + + If no slaves are active, the first slave to recover is + made the active slave. + + When initially enslaved, the primary slave is always made + the active slave. + + Changing the primary_reselect policy via sysfs will cause an + immediate selection of the best active slave according to the new + policy. This may or may not result in a change of the active + slave, depending upon the circumstances. + + This option was added for bonding version 3.6.0. + +tlb_dynamic_lb + + Specifies if dynamic shuffling of flows is enabled in tlb + mode. The value has no effect on any other modes. + + The default behavior of tlb mode is to shuffle active flows across + slaves based on the load in that interval. This gives nice lb + characteristics but can cause packet reordering. If re-ordering is + a concern use this variable to disable flow shuffling and rely on + load balancing provided solely by the hash distribution. + xmit-hash-policy can be used to select the appropriate hashing for + the setup. + + The sysfs entry can be used to change the setting per bond device + and the initial value is derived from the module parameter. The + sysfs entry is allowed to be changed only if the bond device is + down. + + The default value is "1" that enables flow shuffling while value "0" + disables it. This option was added in bonding driver 3.7.1 + + +updelay + + Specifies the time, in milliseconds, to wait before enabling a + slave after a link recovery has been detected. This option is + only valid for the miimon link monitor. The updelay value + should be a multiple of the miimon value; if not, it will be + rounded down to the nearest multiple. The default value is 0. + +use_carrier + + Specifies whether or not miimon should use MII or ETHTOOL + ioctls vs. netif_carrier_ok() to determine the link + status. The MII or ETHTOOL ioctls are less efficient and + utilize a deprecated calling sequence within the kernel. The + netif_carrier_ok() relies on the device driver to maintain its + state with netif_carrier_on/off; at this writing, most, but + not all, device drivers support this facility. + + If bonding insists that the link is up when it should not be, + it may be that your network device driver does not support + netif_carrier_on/off. The default state for netif_carrier is + "carrier on," so if a driver does not support netif_carrier, + it will appear as if the link is always up. In this case, + setting use_carrier to 0 will cause bonding to revert to the + MII / ETHTOOL ioctl method to determine the link state. + + A value of 1 enables the use of netif_carrier_ok(), a value of + 0 will use the deprecated MII / ETHTOOL ioctls. The default + value is 1. + +xmit_hash_policy + + Selects the transmit hash policy to use for slave selection in + balance-xor, 802.3ad, and tlb modes. Possible values are: + + layer2 + + Uses XOR of hardware MAC addresses and packet type ID + field to generate the hash. The formula is + + hash = source MAC XOR destination MAC XOR packet type ID + slave number = hash modulo slave count + + This algorithm will place all traffic to a particular + network peer on the same slave. + + This algorithm is 802.3ad compliant. + + layer2+3 + + This policy uses a combination of layer2 and layer3 + protocol information to generate the hash. + + Uses XOR of hardware MAC addresses and IP addresses to + generate the hash. The formula is + + hash = source MAC XOR destination MAC XOR packet type ID + hash = hash XOR source IP XOR destination IP + hash = hash XOR (hash RSHIFT 16) + hash = hash XOR (hash RSHIFT 8) + And then hash is reduced modulo slave count. + + If the protocol is IPv6 then the source and destination + addresses are first hashed using ipv6_addr_hash. + + This algorithm will place all traffic to a particular + network peer on the same slave. For non-IP traffic, + the formula is the same as for the layer2 transmit + hash policy. + + This policy is intended to provide a more balanced + distribution of traffic than layer2 alone, especially + in environments where a layer3 gateway device is + required to reach most destinations. + + This algorithm is 802.3ad compliant. + + layer3+4 + + This policy uses upper layer protocol information, + when available, to generate the hash. This allows for + traffic to a particular network peer to span multiple + slaves, although a single connection will not span + multiple slaves. + + The formula for unfragmented TCP and UDP packets is + + hash = source port, destination port (as in the header) + hash = hash XOR source IP XOR destination IP + hash = hash XOR (hash RSHIFT 16) + hash = hash XOR (hash RSHIFT 8) + And then hash is reduced modulo slave count. + + If the protocol is IPv6 then the source and destination + addresses are first hashed using ipv6_addr_hash. + + For fragmented TCP or UDP packets and all other IPv4 and + IPv6 protocol traffic, the source and destination port + information is omitted. For non-IP traffic, the + formula is the same as for the layer2 transmit hash + policy. + + This algorithm is not fully 802.3ad compliant. A + single TCP or UDP conversation containing both + fragmented and unfragmented packets will see packets + striped across two interfaces. This may result in out + of order delivery. Most traffic types will not meet + this criteria, as TCP rarely fragments traffic, and + most UDP traffic is not involved in extended + conversations. Other implementations of 802.3ad may + or may not tolerate this noncompliance. + + encap2+3 + + This policy uses the same formula as layer2+3 but it + relies on skb_flow_dissect to obtain the header fields + which might result in the use of inner headers if an + encapsulation protocol is used. For example this will + improve the performance for tunnel users because the + packets will be distributed according to the encapsulated + flows. + + encap3+4 + + This policy uses the same formula as layer3+4 but it + relies on skb_flow_dissect to obtain the header fields + which might result in the use of inner headers if an + encapsulation protocol is used. For example this will + improve the performance for tunnel users because the + packets will be distributed according to the encapsulated + flows. + + The default value is layer2. This option was added in bonding + version 2.6.3. In earlier versions of bonding, this parameter + does not exist, and the layer2 policy is the only policy. The + layer2+3 value was added for bonding version 3.2.2. + +resend_igmp + + Specifies the number of IGMP membership reports to be issued after + a failover event. One membership report is issued immediately after + the failover, subsequent packets are sent in each 200ms interval. + + The valid range is 0 - 255; the default value is 1. A value of 0 + prevents the IGMP membership report from being issued in response + to the failover event. + + This option is useful for bonding modes balance-rr (0), active-backup + (1), balance-tlb (5) and balance-alb (6), in which a failover can + switch the IGMP traffic from one slave to another. Therefore a fresh + IGMP report must be issued to cause the switch to forward the incoming + IGMP traffic over the newly selected slave. + + This option was added for bonding version 3.7.0. + +lp_interval + + Specifies the number of seconds between instances where the bonding + driver sends learning packets to each slaves peer switch. + + The valid range is 1 - 0x7fffffff; the default value is 1. This Option + has effect only in balance-tlb and balance-alb modes. + +3. Configuring Bonding Devices +============================== + +You can configure bonding using either your distro's network +initialization scripts, or manually using either iproute2 or the +sysfs interface. Distros generally use one of three packages for the +network initialization scripts: initscripts, sysconfig or interfaces. +Recent versions of these packages have support for bonding, while older +versions do not. + +We will first describe the options for configuring bonding for +distros using versions of initscripts, sysconfig and interfaces with full +or partial support for bonding, then provide information on enabling +bonding without support from the network initialization scripts (i.e., +older versions of initscripts or sysconfig). + +If you're unsure whether your distro uses sysconfig, +initscripts or interfaces, or don't know if it's new enough, have no fear. +Determining this is fairly straightforward. + +First, look for a file called interfaces in /etc/network directory. +If this file is present in your system, then your system use interfaces. See +Configuration with Interfaces Support. + +Else, issue the command:: + + $ rpm -qf /sbin/ifup + +It will respond with a line of text starting with either +"initscripts" or "sysconfig," followed by some numbers. This is the +package that provides your network initialization scripts. + +Next, to determine if your installation supports bonding, +issue the command:: + + $ grep ifenslave /sbin/ifup + +If this returns any matches, then your initscripts or +sysconfig has support for bonding. + +3.1 Configuration with Sysconfig Support +---------------------------------------- + +This section applies to distros using a version of sysconfig +with bonding support, for example, SuSE Linux Enterprise Server 9. + +SuSE SLES 9's networking configuration system does support +bonding, however, at this writing, the YaST system configuration +front end does not provide any means to work with bonding devices. +Bonding devices can be managed by hand, however, as follows. + +First, if they have not already been configured, configure the +slave devices. On SLES 9, this is most easily done by running the +yast2 sysconfig configuration utility. The goal is for to create an +ifcfg-id file for each slave device. The simplest way to accomplish +this is to configure the devices for DHCP (this is only to get the +file ifcfg-id file created; see below for some issues with DHCP). The +name of the configuration file for each device will be of the form:: + + ifcfg-id-xx:xx:xx:xx:xx:xx + +Where the "xx" portion will be replaced with the digits from +the device's permanent MAC address. + +Once the set of ifcfg-id-xx:xx:xx:xx:xx:xx files has been +created, it is necessary to edit the configuration files for the slave +devices (the MAC addresses correspond to those of the slave devices). +Before editing, the file will contain multiple lines, and will look +something like this:: + + BOOTPROTO='dhcp' + STARTMODE='on' + USERCTL='no' + UNIQUE='XNzu.WeZGOGF+4wE' + _nm_name='bus-pci-0001:61:01.0' + +Change the BOOTPROTO and STARTMODE lines to the following:: + + BOOTPROTO='none' + STARTMODE='off' + +Do not alter the UNIQUE or _nm_name lines. Remove any other +lines (USERCTL, etc). + +Once the ifcfg-id-xx:xx:xx:xx:xx:xx files have been modified, +it's time to create the configuration file for the bonding device +itself. This file is named ifcfg-bondX, where X is the number of the +bonding device to create, starting at 0. The first such file is +ifcfg-bond0, the second is ifcfg-bond1, and so on. The sysconfig +network configuration system will correctly start multiple instances +of bonding. + +The contents of the ifcfg-bondX file is as follows:: + + BOOTPROTO="static" + BROADCAST="10.0.2.255" + IPADDR="10.0.2.10" + NETMASK="255.255.0.0" + NETWORK="10.0.2.0" + REMOTE_IPADDR="" + STARTMODE="onboot" + BONDING_MASTER="yes" + BONDING_MODULE_OPTS="mode=active-backup miimon=100" + BONDING_SLAVE0="eth0" + BONDING_SLAVE1="bus-pci-0000:06:08.1" + +Replace the sample BROADCAST, IPADDR, NETMASK and NETWORK +values with the appropriate values for your network. + +The STARTMODE specifies when the device is brought online. +The possible values are: + + ======== ====================================================== + onboot The device is started at boot time. If you're not + sure, this is probably what you want. + + manual The device is started only when ifup is called + manually. Bonding devices may be configured this + way if you do not wish them to start automatically + at boot for some reason. + + hotplug The device is started by a hotplug event. This is not + a valid choice for a bonding device. + + off or The device configuration is ignored. + ignore + ======== ====================================================== + +The line BONDING_MASTER='yes' indicates that the device is a +bonding master device. The only useful value is "yes." + +The contents of BONDING_MODULE_OPTS are supplied to the +instance of the bonding module for this device. Specify the options +for the bonding mode, link monitoring, and so on here. Do not include +the max_bonds bonding parameter; this will confuse the configuration +system if you have multiple bonding devices. + +Finally, supply one BONDING_SLAVEn="slave device" for each +slave. where "n" is an increasing value, one for each slave. The +"slave device" is either an interface name, e.g., "eth0", or a device +specifier for the network device. The interface name is easier to +find, but the ethN names are subject to change at boot time if, e.g., +a device early in the sequence has failed. The device specifiers +(bus-pci-0000:06:08.1 in the example above) specify the physical +network device, and will not change unless the device's bus location +changes (for example, it is moved from one PCI slot to another). The +example above uses one of each type for demonstration purposes; most +configurations will choose one or the other for all slave devices. + +When all configuration files have been modified or created, +networking must be restarted for the configuration changes to take +effect. This can be accomplished via the following:: + + # /etc/init.d/network restart + +Note that the network control script (/sbin/ifdown) will +remove the bonding module as part of the network shutdown processing, +so it is not necessary to remove the module by hand if, e.g., the +module parameters have changed. + +Also, at this writing, YaST/YaST2 will not manage bonding +devices (they do not show bonding interfaces on its list of network +devices). It is necessary to edit the configuration file by hand to +change the bonding configuration. + +Additional general options and details of the ifcfg file +format can be found in an example ifcfg template file:: + + /etc/sysconfig/network/ifcfg.template + +Note that the template does not document the various ``BONDING_*`` +settings described above, but does describe many of the other options. + +3.1.1 Using DHCP with Sysconfig +------------------------------- + +Under sysconfig, configuring a device with BOOTPROTO='dhcp' +will cause it to query DHCP for its IP address information. At this +writing, this does not function for bonding devices; the scripts +attempt to obtain the device address from DHCP prior to adding any of +the slave devices. Without active slaves, the DHCP requests are not +sent to the network. + +3.1.2 Configuring Multiple Bonds with Sysconfig +----------------------------------------------- + +The sysconfig network initialization system is capable of +handling multiple bonding devices. All that is necessary is for each +bonding instance to have an appropriately configured ifcfg-bondX file +(as described above). Do not specify the "max_bonds" parameter to any +instance of bonding, as this will confuse sysconfig. If you require +multiple bonding devices with identical parameters, create multiple +ifcfg-bondX files. + +Because the sysconfig scripts supply the bonding module +options in the ifcfg-bondX file, it is not necessary to add them to +the system ``/etc/modules.d/*.conf`` configuration files. + +3.2 Configuration with Initscripts Support +------------------------------------------ + +This section applies to distros using a recent version of +initscripts with bonding support, for example, Red Hat Enterprise Linux +version 3 or later, Fedora, etc. On these systems, the network +initialization scripts have knowledge of bonding, and can be configured to +control bonding devices. Note that older versions of the initscripts +package have lower levels of support for bonding; this will be noted where +applicable. + +These distros will not automatically load the network adapter +driver unless the ethX device is configured with an IP address. +Because of this constraint, users must manually configure a +network-script file for all physical adapters that will be members of +a bondX link. Network script files are located in the directory: + +/etc/sysconfig/network-scripts + +The file name must be prefixed with "ifcfg-eth" and suffixed +with the adapter's physical adapter number. For example, the script +for eth0 would be named /etc/sysconfig/network-scripts/ifcfg-eth0. +Place the following text in the file:: + + DEVICE=eth0 + USERCTL=no + ONBOOT=yes + MASTER=bond0 + SLAVE=yes + BOOTPROTO=none + +The DEVICE= line will be different for every ethX device and +must correspond with the name of the file, i.e., ifcfg-eth1 must have +a device line of DEVICE=eth1. The setting of the MASTER= line will +also depend on the final bonding interface name chosen for your bond. +As with other network devices, these typically start at 0, and go up +one for each device, i.e., the first bonding instance is bond0, the +second is bond1, and so on. + +Next, create a bond network script. The file name for this +script will be /etc/sysconfig/network-scripts/ifcfg-bondX where X is +the number of the bond. For bond0 the file is named "ifcfg-bond0", +for bond1 it is named "ifcfg-bond1", and so on. Within that file, +place the following text:: + + DEVICE=bond0 + IPADDR=192.168.1.1 + NETMASK=255.255.255.0 + NETWORK=192.168.1.0 + BROADCAST=192.168.1.255 + ONBOOT=yes + BOOTPROTO=none + USERCTL=no + +Be sure to change the networking specific lines (IPADDR, +NETMASK, NETWORK and BROADCAST) to match your network configuration. + +For later versions of initscripts, such as that found with Fedora +7 (or later) and Red Hat Enterprise Linux version 5 (or later), it is possible, +and, indeed, preferable, to specify the bonding options in the ifcfg-bond0 +file, e.g. a line of the format:: + + BONDING_OPTS="mode=active-backup arp_interval=60 arp_ip_target=192.168.1.254" + +will configure the bond with the specified options. The options +specified in BONDING_OPTS are identical to the bonding module parameters +except for the arp_ip_target field when using versions of initscripts older +than and 8.57 (Fedora 8) and 8.45.19 (Red Hat Enterprise Linux 5.2). When +using older versions each target should be included as a separate option and +should be preceded by a '+' to indicate it should be added to the list of +queried targets, e.g.,:: + + arp_ip_target=+192.168.1.1 arp_ip_target=+192.168.1.2 + +is the proper syntax to specify multiple targets. When specifying +options via BONDING_OPTS, it is not necessary to edit +``/etc/modprobe.d/*.conf``. + +For even older versions of initscripts that do not support +BONDING_OPTS, it is necessary to edit /etc/modprobe.d/*.conf, depending upon +your distro) to load the bonding module with your desired options when the +bond0 interface is brought up. The following lines in /etc/modprobe.d/*.conf +will load the bonding module, and select its options: + + alias bond0 bonding + options bond0 mode=balance-alb miimon=100 + +Replace the sample parameters with the appropriate set of +options for your configuration. + +Finally run "/etc/rc.d/init.d/network restart" as root. This +will restart the networking subsystem and your bond link should be now +up and running. + +3.2.1 Using DHCP with Initscripts +--------------------------------- + +Recent versions of initscripts (the versions supplied with Fedora +Core 3 and Red Hat Enterprise Linux 4, or later versions, are reported to +work) have support for assigning IP information to bonding devices via +DHCP. + +To configure bonding for DHCP, configure it as described +above, except replace the line "BOOTPROTO=none" with "BOOTPROTO=dhcp" +and add a line consisting of "TYPE=Bonding". Note that the TYPE value +is case sensitive. + +3.2.2 Configuring Multiple Bonds with Initscripts +------------------------------------------------- + +Initscripts packages that are included with Fedora 7 and Red Hat +Enterprise Linux 5 support multiple bonding interfaces by simply +specifying the appropriate BONDING_OPTS= in ifcfg-bondX where X is the +number of the bond. This support requires sysfs support in the kernel, +and a bonding driver of version 3.0.0 or later. Other configurations may +not support this method for specifying multiple bonding interfaces; for +those instances, see the "Configuring Multiple Bonds Manually" section, +below. + +3.3 Configuring Bonding Manually with iproute2 +----------------------------------------------- + +This section applies to distros whose network initialization +scripts (the sysconfig or initscripts package) do not have specific +knowledge of bonding. One such distro is SuSE Linux Enterprise Server +version 8. + +The general method for these systems is to place the bonding +module parameters into a config file in /etc/modprobe.d/ (as +appropriate for the installed distro), then add modprobe and/or +`ip link` commands to the system's global init script. The name of +the global init script differs; for sysconfig, it is +/etc/init.d/boot.local and for initscripts it is /etc/rc.d/rc.local. + +For example, if you wanted to make a simple bond of two e100 +devices (presumed to be eth0 and eth1), and have it persist across +reboots, edit the appropriate file (/etc/init.d/boot.local or +/etc/rc.d/rc.local), and add the following:: + + modprobe bonding mode=balance-alb miimon=100 + modprobe e100 + ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up + ip link set eth0 master bond0 + ip link set eth1 master bond0 + +Replace the example bonding module parameters and bond0 +network configuration (IP address, netmask, etc) with the appropriate +values for your configuration. + +Unfortunately, this method will not provide support for the +ifup and ifdown scripts on the bond devices. To reload the bonding +configuration, it is necessary to run the initialization script, e.g.,:: + + # /etc/init.d/boot.local + +or:: + + # /etc/rc.d/rc.local + +It may be desirable in such a case to create a separate script +which only initializes the bonding configuration, then call that +separate script from within boot.local. This allows for bonding to be +enabled without re-running the entire global init script. + +To shut down the bonding devices, it is necessary to first +mark the bonding device itself as being down, then remove the +appropriate device driver modules. For our example above, you can do +the following:: + + # ifconfig bond0 down + # rmmod bonding + # rmmod e100 + +Again, for convenience, it may be desirable to create a script +with these commands. + + +3.3.1 Configuring Multiple Bonds Manually +----------------------------------------- + +This section contains information on configuring multiple +bonding devices with differing options for those systems whose network +initialization scripts lack support for configuring multiple bonds. + +If you require multiple bonding devices, but all with the same +options, you may wish to use the "max_bonds" module parameter, +documented above. + +To create multiple bonding devices with differing options, it is +preferable to use bonding parameters exported by sysfs, documented in the +section below. + +For versions of bonding without sysfs support, the only means to +provide multiple instances of bonding with differing options is to load +the bonding driver multiple times. Note that current versions of the +sysconfig network initialization scripts handle this automatically; if +your distro uses these scripts, no special action is needed. See the +section Configuring Bonding Devices, above, if you're not sure about your +network initialization scripts. + +To load multiple instances of the module, it is necessary to +specify a different name for each instance (the module loading system +requires that every loaded module, even multiple instances of the same +module, have a unique name). This is accomplished by supplying multiple +sets of bonding options in ``/etc/modprobe.d/*.conf``, for example:: + + alias bond0 bonding + options bond0 -o bond0 mode=balance-rr miimon=100 + + alias bond1 bonding + options bond1 -o bond1 mode=balance-alb miimon=50 + +will load the bonding module two times. The first instance is +named "bond0" and creates the bond0 device in balance-rr mode with an +miimon of 100. The second instance is named "bond1" and creates the +bond1 device in balance-alb mode with an miimon of 50. + +In some circumstances (typically with older distributions), +the above does not work, and the second bonding instance never sees +its options. In that case, the second options line can be substituted +as follows:: + + install bond1 /sbin/modprobe --ignore-install bonding -o bond1 \ + mode=balance-alb miimon=50 + +This may be repeated any number of times, specifying a new and +unique name in place of bond1 for each subsequent instance. + +It has been observed that some Red Hat supplied kernels are unable +to rename modules at load time (the "-o bond1" part). Attempts to pass +that option to modprobe will produce an "Operation not permitted" error. +This has been reported on some Fedora Core kernels, and has been seen on +RHEL 4 as well. On kernels exhibiting this problem, it will be impossible +to configure multiple bonds with differing parameters (as they are older +kernels, and also lack sysfs support). + +3.4 Configuring Bonding Manually via Sysfs +------------------------------------------ + +Starting with version 3.0.0, Channel Bonding may be configured +via the sysfs interface. This interface allows dynamic configuration +of all bonds in the system without unloading the module. It also +allows for adding and removing bonds at runtime. Ifenslave is no +longer required, though it is still supported. + +Use of the sysfs interface allows you to use multiple bonds +with different configurations without having to reload the module. +It also allows you to use multiple, differently configured bonds when +bonding is compiled into the kernel. + +You must have the sysfs filesystem mounted to configure +bonding this way. The examples in this document assume that you +are using the standard mount point for sysfs, e.g. /sys. If your +sysfs filesystem is mounted elsewhere, you will need to adjust the +example paths accordingly. + +Creating and Destroying Bonds +----------------------------- +To add a new bond foo:: + + # echo +foo > /sys/class/net/bonding_masters + +To remove an existing bond bar:: + + # echo -bar > /sys/class/net/bonding_masters + +To show all existing bonds:: + + # cat /sys/class/net/bonding_masters + +.. note:: + + due to 4K size limitation of sysfs files, this list may be + truncated if you have more than a few hundred bonds. This is unlikely + to occur under normal operating conditions. + +Adding and Removing Slaves +-------------------------- +Interfaces may be enslaved to a bond using the file +/sys/class/net//bonding/slaves. The semantics for this file +are the same as for the bonding_masters file. + +To enslave interface eth0 to bond bond0:: + + # ifconfig bond0 up + # echo +eth0 > /sys/class/net/bond0/bonding/slaves + +To free slave eth0 from bond bond0:: + + # echo -eth0 > /sys/class/net/bond0/bonding/slaves + +When an interface is enslaved to a bond, symlinks between the +two are created in the sysfs filesystem. In this case, you would get +/sys/class/net/bond0/slave_eth0 pointing to /sys/class/net/eth0, and +/sys/class/net/eth0/master pointing to /sys/class/net/bond0. + +This means that you can tell quickly whether or not an +interface is enslaved by looking for the master symlink. Thus: +# echo -eth0 > /sys/class/net/eth0/master/bonding/slaves +will free eth0 from whatever bond it is enslaved to, regardless of +the name of the bond interface. + +Changing a Bond's Configuration +------------------------------- +Each bond may be configured individually by manipulating the +files located in /sys/class/net//bonding + +The names of these files correspond directly with the command- +line parameters described elsewhere in this file, and, with the +exception of arp_ip_target, they accept the same values. To see the +current setting, simply cat the appropriate file. + +A few examples will be given here; for specific usage +guidelines for each parameter, see the appropriate section in this +document. + +To configure bond0 for balance-alb mode:: + + # ifconfig bond0 down + # echo 6 > /sys/class/net/bond0/bonding/mode + - or - + # echo balance-alb > /sys/class/net/bond0/bonding/mode + +.. note:: + + The bond interface must be down before the mode can be changed. + +To enable MII monitoring on bond0 with a 1 second interval:: + + # echo 1000 > /sys/class/net/bond0/bonding/miimon + +.. note:: + + If ARP monitoring is enabled, it will disabled when MII + monitoring is enabled, and vice-versa. + +To add ARP targets:: + + # echo +192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target + # echo +192.168.0.101 > /sys/class/net/bond0/bonding/arp_ip_target + +.. note:: + + up to 16 target addresses may be specified. + +To remove an ARP target:: + + # echo -192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target + +To configure the interval between learning packet transmits:: + + # echo 12 > /sys/class/net/bond0/bonding/lp_interval + +.. note:: + + the lp_interval is the number of seconds between instances where + the bonding driver sends learning packets to each slaves peer switch. The + default interval is 1 second. + +Example Configuration +--------------------- +We begin with the same example that is shown in section 3.3, +executed with sysfs, and without using ifenslave. + +To make a simple bond of two e100 devices (presumed to be eth0 +and eth1), and have it persist across reboots, edit the appropriate +file (/etc/init.d/boot.local or /etc/rc.d/rc.local), and add the +following:: + + modprobe bonding + modprobe e100 + echo balance-alb > /sys/class/net/bond0/bonding/mode + ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up + echo 100 > /sys/class/net/bond0/bonding/miimon + echo +eth0 > /sys/class/net/bond0/bonding/slaves + echo +eth1 > /sys/class/net/bond0/bonding/slaves + +To add a second bond, with two e1000 interfaces in +active-backup mode, using ARP monitoring, add the following lines to +your init script:: + + modprobe e1000 + echo +bond1 > /sys/class/net/bonding_masters + echo active-backup > /sys/class/net/bond1/bonding/mode + ifconfig bond1 192.168.2.1 netmask 255.255.255.0 up + echo +192.168.2.100 /sys/class/net/bond1/bonding/arp_ip_target + echo 2000 > /sys/class/net/bond1/bonding/arp_interval + echo +eth2 > /sys/class/net/bond1/bonding/slaves + echo +eth3 > /sys/class/net/bond1/bonding/slaves + +3.5 Configuration with Interfaces Support +----------------------------------------- + +This section applies to distros which use /etc/network/interfaces file +to describe network interface configuration, most notably Debian and it's +derivatives. + +The ifup and ifdown commands on Debian don't support bonding out of +the box. The ifenslave-2.6 package should be installed to provide bonding +support. Once installed, this package will provide ``bond-*`` options +to be used into /etc/network/interfaces. + +Note that ifenslave-2.6 package will load the bonding module and use +the ifenslave command when appropriate. + +Example Configurations +---------------------- + +In /etc/network/interfaces, the following stanza will configure bond0, in +active-backup mode, with eth0 and eth1 as slaves:: + + auto bond0 + iface bond0 inet dhcp + bond-slaves eth0 eth1 + bond-mode active-backup + bond-miimon 100 + bond-primary eth0 eth1 + +If the above configuration doesn't work, you might have a system using +upstart for system startup. This is most notably true for recent +Ubuntu versions. The following stanza in /etc/network/interfaces will +produce the same result on those systems:: + + auto bond0 + iface bond0 inet dhcp + bond-slaves none + bond-mode active-backup + bond-miimon 100 + + auto eth0 + iface eth0 inet manual + bond-master bond0 + bond-primary eth0 eth1 + + auto eth1 + iface eth1 inet manual + bond-master bond0 + bond-primary eth0 eth1 + +For a full list of ``bond-*`` supported options in /etc/network/interfaces and +some more advanced examples tailored to you particular distros, see the files in +/usr/share/doc/ifenslave-2.6. + +3.6 Overriding Configuration for Special Cases +---------------------------------------------- + +When using the bonding driver, the physical port which transmits a frame is +typically selected by the bonding driver, and is not relevant to the user or +system administrator. The output port is simply selected using the policies of +the selected bonding mode. On occasion however, it is helpful to direct certain +classes of traffic to certain physical interfaces on output to implement +slightly more complex policies. For example, to reach a web server over a +bonded interface in which eth0 connects to a private network, while eth1 +connects via a public network, it may be desirous to bias the bond to send said +traffic over eth0 first, using eth1 only as a fall back, while all other traffic +can safely be sent over either interface. Such configurations may be achieved +using the traffic control utilities inherent in linux. + +By default the bonding driver is multiqueue aware and 16 queues are created +when the driver initializes (see Documentation/networking/multiqueue.txt +for details). If more or less queues are desired the module parameter +tx_queues can be used to change this value. There is no sysfs parameter +available as the allocation is done at module init time. + +The output of the file /proc/net/bonding/bondX has changed so the output Queue +ID is now printed for each slave:: + + Bonding Mode: fault-tolerance (active-backup) + Primary Slave: None + Currently Active Slave: eth0 + MII Status: up + MII Polling Interval (ms): 0 + Up Delay (ms): 0 + Down Delay (ms): 0 + + Slave Interface: eth0 + MII Status: up + Link Failure Count: 0 + Permanent HW addr: 00:1a:a0:12:8f:cb + Slave queue ID: 0 + + Slave Interface: eth1 + MII Status: up + Link Failure Count: 0 + Permanent HW addr: 00:1a:a0:12:8f:cc + Slave queue ID: 2 + +The queue_id for a slave can be set using the command:: + + # echo "eth1:2" > /sys/class/net/bond0/bonding/queue_id + +Any interface that needs a queue_id set should set it with multiple calls +like the one above until proper priorities are set for all interfaces. On +distributions that allow configuration via initscripts, multiple 'queue_id' +arguments can be added to BONDING_OPTS to set all needed slave queues. + +These queue id's can be used in conjunction with the tc utility to configure +a multiqueue qdisc and filters to bias certain traffic to transmit on certain +slave devices. For instance, say we wanted, in the above configuration to +force all traffic bound to 192.168.1.100 to use eth1 in the bond as its output +device. The following commands would accomplish this:: + + # tc qdisc add dev bond0 handle 1 root multiq + + # tc filter add dev bond0 protocol ip parent 1: prio 1 u32 match ip \ + dst 192.168.1.100 action skbedit queue_mapping 2 + +These commands tell the kernel to attach a multiqueue queue discipline to the +bond0 interface and filter traffic enqueued to it, such that packets with a dst +ip of 192.168.1.100 have their output queue mapping value overwritten to 2. +This value is then passed into the driver, causing the normal output path +selection policy to be overridden, selecting instead qid 2, which maps to eth1. + +Note that qid values begin at 1. Qid 0 is reserved to initiate to the driver +that normal output policy selection should take place. One benefit to simply +leaving the qid for a slave to 0 is the multiqueue awareness in the bonding +driver that is now present. This awareness allows tc filters to be placed on +slave devices as well as bond devices and the bonding driver will simply act as +a pass-through for selecting output queues on the slave device rather than +output port selection. + +This feature first appeared in bonding driver version 3.7.0 and support for +output slave selection was limited to round-robin and active-backup modes. + +3.7 Configuring LACP for 802.3ad mode in a more secure way +---------------------------------------------------------- + +When using 802.3ad bonding mode, the Actor (host) and Partner (switch) +exchange LACPDUs. These LACPDUs cannot be sniffed, because they are +destined to link local mac addresses (which switches/bridges are not +supposed to forward). However, most of the values are easily predictable +or are simply the machine's MAC address (which is trivially known to all +other hosts in the same L2). This implies that other machines in the L2 +domain can spoof LACPDU packets from other hosts to the switch and potentially +cause mayhem by joining (from the point of view of the switch) another +machine's aggregate, thus receiving a portion of that hosts incoming +traffic and / or spoofing traffic from that machine themselves (potentially +even successfully terminating some portion of flows). Though this is not +a likely scenario, one could avoid this possibility by simply configuring +few bonding parameters: + + (a) ad_actor_system : You can set a random mac-address that can be used for + these LACPDU exchanges. The value can not be either NULL or Multicast. + Also it's preferable to set the local-admin bit. Following shell code + generates a random mac-address as described above:: + + # sys_mac_addr=$(printf '%02x:%02x:%02x:%02x:%02x:%02x' \ + $(( (RANDOM & 0xFE) | 0x02 )) \ + $(( RANDOM & 0xFF )) \ + $(( RANDOM & 0xFF )) \ + $(( RANDOM & 0xFF )) \ + $(( RANDOM & 0xFF )) \ + $(( RANDOM & 0xFF ))) + # echo $sys_mac_addr > /sys/class/net/bond0/bonding/ad_actor_system + + (b) ad_actor_sys_prio : Randomize the system priority. The default value + is 65535, but system can take the value from 1 - 65535. Following shell + code generates random priority and sets it:: + + # sys_prio=$(( 1 + RANDOM + RANDOM )) + # echo $sys_prio > /sys/class/net/bond0/bonding/ad_actor_sys_prio + + (c) ad_user_port_key : Use the user portion of the port-key. The default + keeps this empty. These are the upper 10 bits of the port-key and value + ranges from 0 - 1023. Following shell code generates these 10 bits and + sets it:: + + # usr_port_key=$(( RANDOM & 0x3FF )) + # echo $usr_port_key > /sys/class/net/bond0/bonding/ad_user_port_key + + +4 Querying Bonding Configuration +================================= + +4.1 Bonding Configuration +------------------------- + +Each bonding device has a read-only file residing in the +/proc/net/bonding directory. The file contents include information +about the bonding configuration, options and state of each slave. + +For example, the contents of /proc/net/bonding/bond0 after the +driver is loaded with parameters of mode=0 and miimon=1000 is +generally as follows:: + + Ethernet Channel Bonding Driver: 2.6.1 (October 29, 2004) + Bonding Mode: load balancing (round-robin) + Currently Active Slave: eth0 + MII Status: up + MII Polling Interval (ms): 1000 + Up Delay (ms): 0 + Down Delay (ms): 0 + + Slave Interface: eth1 + MII Status: up + Link Failure Count: 1 + + Slave Interface: eth0 + MII Status: up + Link Failure Count: 1 + +The precise format and contents will change depending upon the +bonding configuration, state, and version of the bonding driver. + +4.2 Network configuration +------------------------- + +The network configuration can be inspected using the ifconfig +command. Bonding devices will have the MASTER flag set; Bonding slave +devices will have the SLAVE flag set. The ifconfig output does not +contain information on which slaves are associated with which masters. + +In the example below, the bond0 interface is the master +(MASTER) while eth0 and eth1 are slaves (SLAVE). Notice all slaves of +bond0 have the same MAC address (HWaddr) as bond0 for all modes except +TLB and ALB that require a unique MAC address for each slave:: + + # /sbin/ifconfig + bond0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 + inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0 + UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 + RX packets:7224794 errors:0 dropped:0 overruns:0 frame:0 + TX packets:3286647 errors:1 dropped:0 overruns:1 carrier:0 + collisions:0 txqueuelen:0 + + eth0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 + UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 + RX packets:3573025 errors:0 dropped:0 overruns:0 frame:0 + TX packets:1643167 errors:1 dropped:0 overruns:1 carrier:0 + collisions:0 txqueuelen:100 + Interrupt:10 Base address:0x1080 + + eth1 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 + UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 + RX packets:3651769 errors:0 dropped:0 overruns:0 frame:0 + TX packets:1643480 errors:0 dropped:0 overruns:0 carrier:0 + collisions:0 txqueuelen:100 + Interrupt:9 Base address:0x1400 + +5. Switch Configuration +======================= + +For this section, "switch" refers to whatever system the +bonded devices are directly connected to (i.e., where the other end of +the cable plugs into). This may be an actual dedicated switch device, +or it may be another regular system (e.g., another computer running +Linux), + +The active-backup, balance-tlb and balance-alb modes do not +require any specific configuration of the switch. + +The 802.3ad mode requires that the switch have the appropriate +ports configured as an 802.3ad aggregation. The precise method used +to configure this varies from switch to switch, but, for example, a +Cisco 3550 series switch requires that the appropriate ports first be +grouped together in a single etherchannel instance, then that +etherchannel is set to mode "lacp" to enable 802.3ad (instead of +standard EtherChannel). + +The balance-rr, balance-xor and broadcast modes generally +require that the switch have the appropriate ports grouped together. +The nomenclature for such a group differs between switches, it may be +called an "etherchannel" (as in the Cisco example, above), a "trunk +group" or some other similar variation. For these modes, each switch +will also have its own configuration options for the switch's transmit +policy to the bond. Typical choices include XOR of either the MAC or +IP addresses. The transmit policy of the two peers does not need to +match. For these three modes, the bonding mode really selects a +transmit policy for an EtherChannel group; all three will interoperate +with another EtherChannel group. + + +6. 802.1q VLAN Support +====================== + +It is possible to configure VLAN devices over a bond interface +using the 8021q driver. However, only packets coming from the 8021q +driver and passing through bonding will be tagged by default. Self +generated packets, for example, bonding's learning packets or ARP +packets generated by either ALB mode or the ARP monitor mechanism, are +tagged internally by bonding itself. As a result, bonding must +"learn" the VLAN IDs configured above it, and use those IDs to tag +self generated packets. + +For reasons of simplicity, and to support the use of adapters +that can do VLAN hardware acceleration offloading, the bonding +interface declares itself as fully hardware offloading capable, it gets +the add_vid/kill_vid notifications to gather the necessary +information, and it propagates those actions to the slaves. In case +of mixed adapter types, hardware accelerated tagged packets that +should go through an adapter that is not offloading capable are +"un-accelerated" by the bonding driver so the VLAN tag sits in the +regular location. + +VLAN interfaces *must* be added on top of a bonding interface +only after enslaving at least one slave. The bonding interface has a +hardware address of 00:00:00:00:00:00 until the first slave is added. +If the VLAN interface is created prior to the first enslavement, it +would pick up the all-zeroes hardware address. Once the first slave +is attached to the bond, the bond device itself will pick up the +slave's hardware address, which is then available for the VLAN device. + +Also, be aware that a similar problem can occur if all slaves +are released from a bond that still has one or more VLAN interfaces on +top of it. When a new slave is added, the bonding interface will +obtain its hardware address from the first slave, which might not +match the hardware address of the VLAN interfaces (which was +ultimately copied from an earlier slave). + +There are two methods to insure that the VLAN device operates +with the correct hardware address if all slaves are removed from a +bond interface: + +1. Remove all VLAN interfaces then recreate them + +2. Set the bonding interface's hardware address so that it +matches the hardware address of the VLAN interfaces. + +Note that changing a VLAN interface's HW address would set the +underlying device -- i.e. the bonding interface -- to promiscuous +mode, which might not be what you want. + + +7. Link Monitoring +================== + +The bonding driver at present supports two schemes for +monitoring a slave device's link state: the ARP monitor and the MII +monitor. + +At the present time, due to implementation restrictions in the +bonding driver itself, it is not possible to enable both ARP and MII +monitoring simultaneously. + +7.1 ARP Monitor Operation +------------------------- + +The ARP monitor operates as its name suggests: it sends ARP +queries to one or more designated peer systems on the network, and +uses the response as an indication that the link is operating. This +gives some assurance that traffic is actually flowing to and from one +or more peers on the local network. + +The ARP monitor relies on the device driver itself to verify +that traffic is flowing. In particular, the driver must keep up to +date the last receive time, dev->last_rx. Drivers that use NETIF_F_LLTX +flag must also update netdev_queue->trans_start. If they do not, then the +ARP monitor will immediately fail any slaves using that driver, and +those slaves will stay down. If networking monitoring (tcpdump, etc) +shows the ARP requests and replies on the network, then it may be that +your device driver is not updating last_rx and trans_start. + +7.2 Configuring Multiple ARP Targets +------------------------------------ + +While ARP monitoring can be done with just one target, it can +be useful in a High Availability setup to have several targets to +monitor. In the case of just one target, the target itself may go +down or have a problem making it unresponsive to ARP requests. Having +an additional target (or several) increases the reliability of the ARP +monitoring. + +Multiple ARP targets must be separated by commas as follows:: + + # example options for ARP monitoring with three targets + alias bond0 bonding + options bond0 arp_interval=60 arp_ip_target=192.168.0.1,192.168.0.3,192.168.0.9 + +For just a single target the options would resemble:: + + # example options for ARP monitoring with one target + alias bond0 bonding + options bond0 arp_interval=60 arp_ip_target=192.168.0.100 + + +7.3 MII Monitor Operation +------------------------- + +The MII monitor monitors only the carrier state of the local +network interface. It accomplishes this in one of three ways: by +depending upon the device driver to maintain its carrier state, by +querying the device's MII registers, or by making an ethtool query to +the device. + +If the use_carrier module parameter is 1 (the default value), +then the MII monitor will rely on the driver for carrier state +information (via the netif_carrier subsystem). As explained in the +use_carrier parameter information, above, if the MII monitor fails to +detect carrier loss on the device (e.g., when the cable is physically +disconnected), it may be that the driver does not support +netif_carrier. + +If use_carrier is 0, then the MII monitor will first query the +device's (via ioctl) MII registers and check the link state. If that +request fails (not just that it returns carrier down), then the MII +monitor will make an ethtool ETHOOL_GLINK request to attempt to obtain +the same information. If both methods fail (i.e., the driver either +does not support or had some error in processing both the MII register +and ethtool requests), then the MII monitor will assume the link is +up. + +8. Potential Sources of Trouble +=============================== + +8.1 Adventures in Routing +------------------------- + +When bonding is configured, it is important that the slave +devices not have routes that supersede routes of the master (or, +generally, not have routes at all). For example, suppose the bonding +device bond0 has two slaves, eth0 and eth1, and the routing table is +as follows:: + + Kernel IP routing table + Destination Gateway Genmask Flags MSS Window irtt Iface + 10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth0 + 10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth1 + 10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 bond0 + 127.0.0.0 0.0.0.0 255.0.0.0 U 40 0 0 lo + +This routing configuration will likely still update the +receive/transmit times in the driver (needed by the ARP monitor), but +may bypass the bonding driver (because outgoing traffic to, in this +case, another host on network 10 would use eth0 or eth1 before bond0). + +The ARP monitor (and ARP itself) may become confused by this +configuration, because ARP requests (generated by the ARP monitor) +will be sent on one interface (bond0), but the corresponding reply +will arrive on a different interface (eth0). This reply looks to ARP +as an unsolicited ARP reply (because ARP matches replies on an +interface basis), and is discarded. The MII monitor is not affected +by the state of the routing table. + +The solution here is simply to insure that slaves do not have +routes of their own, and if for some reason they must, those routes do +not supersede routes of their master. This should generally be the +case, but unusual configurations or errant manual or automatic static +route additions may cause trouble. + +8.2 Ethernet Device Renaming +---------------------------- + +On systems with network configuration scripts that do not +associate physical devices directly with network interface names (so +that the same physical device always has the same "ethX" name), it may +be necessary to add some special logic to config files in +/etc/modprobe.d/. + +For example, given a modules.conf containing the following:: + + alias bond0 bonding + options bond0 mode=some-mode miimon=50 + alias eth0 tg3 + alias eth1 tg3 + alias eth2 e1000 + alias eth3 e1000 + +If neither eth0 and eth1 are slaves to bond0, then when the +bond0 interface comes up, the devices may end up reordered. This +happens because bonding is loaded first, then its slave device's +drivers are loaded next. Since no other drivers have been loaded, +when the e1000 driver loads, it will receive eth0 and eth1 for its +devices, but the bonding configuration tries to enslave eth2 and eth3 +(which may later be assigned to the tg3 devices). + +Adding the following:: + + add above bonding e1000 tg3 + +causes modprobe to load e1000 then tg3, in that order, when +bonding is loaded. This command is fully documented in the +modules.conf manual page. + +On systems utilizing modprobe an equivalent problem can occur. +In this case, the following can be added to config files in +/etc/modprobe.d/ as:: + + softdep bonding pre: tg3 e1000 + +This will load tg3 and e1000 modules before loading the bonding one. +Full documentation on this can be found in the modprobe.d and modprobe +manual pages. + +8.3. Painfully Slow Or No Failed Link Detection By Miimon +--------------------------------------------------------- + +By default, bonding enables the use_carrier option, which +instructs bonding to trust the driver to maintain carrier state. + +As discussed in the options section, above, some drivers do +not support the netif_carrier_on/_off link state tracking system. +With use_carrier enabled, bonding will always see these links as up, +regardless of their actual state. + +Additionally, other drivers do support netif_carrier, but do +not maintain it in real time, e.g., only polling the link state at +some fixed interval. In this case, miimon will detect failures, but +only after some long period of time has expired. If it appears that +miimon is very slow in detecting link failures, try specifying +use_carrier=0 to see if that improves the failure detection time. If +it does, then it may be that the driver checks the carrier state at a +fixed interval, but does not cache the MII register values (so the +use_carrier=0 method of querying the registers directly works). If +use_carrier=0 does not improve the failover, then the driver may cache +the registers, or the problem may be elsewhere. + +Also, remember that miimon only checks for the device's +carrier state. It has no way to determine the state of devices on or +beyond other ports of a switch, or if a switch is refusing to pass +traffic while still maintaining carrier on. + +9. SNMP agents +=============== + +If running SNMP agents, the bonding driver should be loaded +before any network drivers participating in a bond. This requirement +is due to the interface index (ipAdEntIfIndex) being associated to +the first interface found with a given IP address. That is, there is +only one ipAdEntIfIndex for each IP address. For example, if eth0 and +eth1 are slaves of bond0 and the driver for eth0 is loaded before the +bonding driver, the interface for the IP address will be associated +with the eth0 interface. This configuration is shown below, the IP +address 192.168.1.1 has an interface index of 2 which indexes to eth0 +in the ifDescr table (ifDescr.2). + +:: + + interfaces.ifTable.ifEntry.ifDescr.1 = lo + interfaces.ifTable.ifEntry.ifDescr.2 = eth0 + interfaces.ifTable.ifEntry.ifDescr.3 = eth1 + interfaces.ifTable.ifEntry.ifDescr.4 = eth2 + interfaces.ifTable.ifEntry.ifDescr.5 = eth3 + interfaces.ifTable.ifEntry.ifDescr.6 = bond0 + ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 5 + ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2 + ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 4 + ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1 + +This problem is avoided by loading the bonding driver before +any network drivers participating in a bond. Below is an example of +loading the bonding driver first, the IP address 192.168.1.1 is +correctly associated with ifDescr.2. + + interfaces.ifTable.ifEntry.ifDescr.1 = lo + interfaces.ifTable.ifEntry.ifDescr.2 = bond0 + interfaces.ifTable.ifEntry.ifDescr.3 = eth0 + interfaces.ifTable.ifEntry.ifDescr.4 = eth1 + interfaces.ifTable.ifEntry.ifDescr.5 = eth2 + interfaces.ifTable.ifEntry.ifDescr.6 = eth3 + ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 6 + ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2 + ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 5 + ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1 + +While some distributions may not report the interface name in +ifDescr, the association between the IP address and IfIndex remains +and SNMP functions such as Interface_Scan_Next will report that +association. + +10. Promiscuous mode +==================== + +When running network monitoring tools, e.g., tcpdump, it is +common to enable promiscuous mode on the device, so that all traffic +is seen (instead of seeing only traffic destined for the local host). +The bonding driver handles promiscuous mode changes to the bonding +master device (e.g., bond0), and propagates the setting to the slave +devices. + +For the balance-rr, balance-xor, broadcast, and 802.3ad modes, +the promiscuous mode setting is propagated to all slaves. + +For the active-backup, balance-tlb and balance-alb modes, the +promiscuous mode setting is propagated only to the active slave. + +For balance-tlb mode, the active slave is the slave currently +receiving inbound traffic. + +For balance-alb mode, the active slave is the slave used as a +"primary." This slave is used for mode-specific control traffic, for +sending to peers that are unassigned or if the load is unbalanced. + +For the active-backup, balance-tlb and balance-alb modes, when +the active slave changes (e.g., due to a link failure), the +promiscuous setting will be propagated to the new active slave. + +11. Configuring Bonding for High Availability +============================================= + +High Availability refers to configurations that provide +maximum network availability by having redundant or backup devices, +links or switches between the host and the rest of the world. The +goal is to provide the maximum availability of network connectivity +(i.e., the network always works), even though other configurations +could provide higher throughput. + +11.1 High Availability in a Single Switch Topology +-------------------------------------------------- + +If two hosts (or a host and a single switch) are directly +connected via multiple physical links, then there is no availability +penalty to optimizing for maximum bandwidth. In this case, there is +only one switch (or peer), so if it fails, there is no alternative +access to fail over to. Additionally, the bonding load balance modes +support link monitoring of their members, so if individual links fail, +the load will be rebalanced across the remaining devices. + +See Section 12, "Configuring Bonding for Maximum Throughput" +for information on configuring bonding with one peer device. + +11.2 High Availability in a Multiple Switch Topology +---------------------------------------------------- + +With multiple switches, the configuration of bonding and the +network changes dramatically. In multiple switch topologies, there is +a trade off between network availability and usable bandwidth. + +Below is a sample network, configured to maximize the +availability of the network:: + + | | + |port3 port3| + +-----+----+ +-----+----+ + | |port2 ISL port2| | + | switch A +--------------------------+ switch B | + | | | | + +-----+----+ +-----++---+ + |port1 port1| + | +-------+ | + +-------------+ host1 +---------------+ + eth0 +-------+ eth1 + +In this configuration, there is a link between the two +switches (ISL, or inter switch link), and multiple ports connecting to +the outside world ("port3" on each switch). There is no technical +reason that this could not be extended to a third switch. + +11.2.1 HA Bonding Mode Selection for Multiple Switch Topology +------------------------------------------------------------- + +In a topology such as the example above, the active-backup and +broadcast modes are the only useful bonding modes when optimizing for +availability; the other modes require all links to terminate on the +same peer for them to behave rationally. + +active-backup: + This is generally the preferred mode, particularly if + the switches have an ISL and play together well. If the + network configuration is such that one switch is specifically + a backup switch (e.g., has lower capacity, higher cost, etc), + then the primary option can be used to insure that the + preferred link is always used when it is available. + +broadcast: + This mode is really a special purpose mode, and is suitable + only for very specific needs. For example, if the two + switches are not connected (no ISL), and the networks beyond + them are totally independent. In this case, if it is + necessary for some specific one-way traffic to reach both + independent networks, then the broadcast mode may be suitable. + +11.2.2 HA Link Monitoring Selection for Multiple Switch Topology +---------------------------------------------------------------- + +The choice of link monitoring ultimately depends upon your +switch. If the switch can reliably fail ports in response to other +failures, then either the MII or ARP monitors should work. For +example, in the above example, if the "port3" link fails at the remote +end, the MII monitor has no direct means to detect this. The ARP +monitor could be configured with a target at the remote end of port3, +thus detecting that failure without switch support. + +In general, however, in a multiple switch topology, the ARP +monitor can provide a higher level of reliability in detecting end to +end connectivity failures (which may be caused by the failure of any +individual component to pass traffic for any reason). Additionally, +the ARP monitor should be configured with multiple targets (at least +one for each switch in the network). This will insure that, +regardless of which switch is active, the ARP monitor has a suitable +target to query. + +Note, also, that of late many switches now support a functionality +generally referred to as "trunk failover." This is a feature of the +switch that causes the link state of a particular switch port to be set +down (or up) when the state of another switch port goes down (or up). +Its purpose is to propagate link failures from logically "exterior" ports +to the logically "interior" ports that bonding is able to monitor via +miimon. Availability and configuration for trunk failover varies by +switch, but this can be a viable alternative to the ARP monitor when using +suitable switches. + +12. Configuring Bonding for Maximum Throughput +============================================== + +12.1 Maximizing Throughput in a Single Switch Topology +------------------------------------------------------ + +In a single switch configuration, the best method to maximize +throughput depends upon the application and network environment. The +various load balancing modes each have strengths and weaknesses in +different environments, as detailed below. + +For this discussion, we will break down the topologies into +two categories. Depending upon the destination of most traffic, we +categorize them into either "gatewayed" or "local" configurations. + +In a gatewayed configuration, the "switch" is acting primarily +as a router, and the majority of traffic passes through this router to +other networks. An example would be the following:: + + + +----------+ +----------+ + | |eth0 port1| | to other networks + | Host A +---------------------+ router +-------------------> + | +---------------------+ | Hosts B and C are out + | |eth1 port2| | here somewhere + +----------+ +----------+ + +The router may be a dedicated router device, or another host +acting as a gateway. For our discussion, the important point is that +the majority of traffic from Host A will pass through the router to +some other network before reaching its final destination. + +In a gatewayed network configuration, although Host A may +communicate with many other systems, all of its traffic will be sent +and received via one other peer on the local network, the router. + +Note that the case of two systems connected directly via +multiple physical links is, for purposes of configuring bonding, the +same as a gatewayed configuration. In that case, it happens that all +traffic is destined for the "gateway" itself, not some other network +beyond the gateway. + +In a local configuration, the "switch" is acting primarily as +a switch, and the majority of traffic passes through this switch to +reach other stations on the same network. An example would be the +following:: + + +----------+ +----------+ +--------+ + | |eth0 port1| +-------+ Host B | + | Host A +------------+ switch |port3 +--------+ + | +------------+ | +--------+ + | |eth1 port2| +------------------+ Host C | + +----------+ +----------+port4 +--------+ + + +Again, the switch may be a dedicated switch device, or another +host acting as a gateway. For our discussion, the important point is +that the majority of traffic from Host A is destined for other hosts +on the same local network (Hosts B and C in the above example). + +In summary, in a gatewayed configuration, traffic to and from +the bonded device will be to the same MAC level peer on the network +(the gateway itself, i.e., the router), regardless of its final +destination. In a local configuration, traffic flows directly to and +from the final destinations, thus, each destination (Host B, Host C) +will be addressed directly by their individual MAC addresses. + +This distinction between a gatewayed and a local network +configuration is important because many of the load balancing modes +available use the MAC addresses of the local network source and +destination to make load balancing decisions. The behavior of each +mode is described below. + + +12.1.1 MT Bonding Mode Selection for Single Switch Topology +----------------------------------------------------------- + +This configuration is the easiest to set up and to understand, +although you will have to decide which bonding mode best suits your +needs. The trade offs for each mode are detailed below: + +balance-rr: + This mode is the only mode that will permit a single + TCP/IP connection to stripe traffic across multiple + interfaces. It is therefore the only mode that will allow a + single TCP/IP stream to utilize more than one interface's + worth of throughput. This comes at a cost, however: the + striping generally results in peer systems receiving packets out + of order, causing TCP/IP's congestion control system to kick + in, often by retransmitting segments. + + It is possible to adjust TCP/IP's congestion limits by + altering the net.ipv4.tcp_reordering sysctl parameter. The + usual default value is 3. But keep in mind TCP stack is able + to automatically increase this when it detects reorders. + + Note that the fraction of packets that will be delivered out of + order is highly variable, and is unlikely to be zero. The level + of reordering depends upon a variety of factors, including the + networking interfaces, the switch, and the topology of the + configuration. Speaking in general terms, higher speed network + cards produce more reordering (due to factors such as packet + coalescing), and a "many to many" topology will reorder at a + higher rate than a "many slow to one fast" configuration. + + Many switches do not support any modes that stripe traffic + (instead choosing a port based upon IP or MAC level addresses); + for those devices, traffic for a particular connection flowing + through the switch to a balance-rr bond will not utilize greater + than one interface's worth of bandwidth. + + If you are utilizing protocols other than TCP/IP, UDP for + example, and your application can tolerate out of order + delivery, then this mode can allow for single stream datagram + performance that scales near linearly as interfaces are added + to the bond. + + This mode requires the switch to have the appropriate ports + configured for "etherchannel" or "trunking." + +active-backup: + There is not much advantage in this network topology to + the active-backup mode, as the inactive backup devices are all + connected to the same peer as the primary. In this case, a + load balancing mode (with link monitoring) will provide the + same level of network availability, but with increased + available bandwidth. On the plus side, active-backup mode + does not require any configuration of the switch, so it may + have value if the hardware available does not support any of + the load balance modes. + +balance-xor: + This mode will limit traffic such that packets destined + for specific peers will always be sent over the same + interface. Since the destination is determined by the MAC + addresses involved, this mode works best in a "local" network + configuration (as described above), with destinations all on + the same local network. This mode is likely to be suboptimal + if all your traffic is passed through a single router (i.e., a + "gatewayed" network configuration, as described above). + + As with balance-rr, the switch ports need to be configured for + "etherchannel" or "trunking." + +broadcast: + Like active-backup, there is not much advantage to this + mode in this type of network topology. + +802.3ad: + This mode can be a good choice for this type of network + topology. The 802.3ad mode is an IEEE standard, so all peers + that implement 802.3ad should interoperate well. The 802.3ad + protocol includes automatic configuration of the aggregates, + so minimal manual configuration of the switch is needed + (typically only to designate that some set of devices is + available for 802.3ad). The 802.3ad standard also mandates + that frames be delivered in order (within certain limits), so + in general single connections will not see misordering of + packets. The 802.3ad mode does have some drawbacks: the + standard mandates that all devices in the aggregate operate at + the same speed and duplex. Also, as with all bonding load + balance modes other than balance-rr, no single connection will + be able to utilize more than a single interface's worth of + bandwidth. + + Additionally, the linux bonding 802.3ad implementation + distributes traffic by peer (using an XOR of MAC addresses + and packet type ID), so in a "gatewayed" configuration, all + outgoing traffic will generally use the same device. Incoming + traffic may also end up on a single device, but that is + dependent upon the balancing policy of the peer's 802.3ad + implementation. In a "local" configuration, traffic will be + distributed across the devices in the bond. + + Finally, the 802.3ad mode mandates the use of the MII monitor, + therefore, the ARP monitor is not available in this mode. + +balance-tlb: + The balance-tlb mode balances outgoing traffic by peer. + Since the balancing is done according to MAC address, in a + "gatewayed" configuration (as described above), this mode will + send all traffic across a single device. However, in a + "local" network configuration, this mode balances multiple + local network peers across devices in a vaguely intelligent + manner (not a simple XOR as in balance-xor or 802.3ad mode), + so that mathematically unlucky MAC addresses (i.e., ones that + XOR to the same value) will not all "bunch up" on a single + interface. + + Unlike 802.3ad, interfaces may be of differing speeds, and no + special switch configuration is required. On the down side, + in this mode all incoming traffic arrives over a single + interface, this mode requires certain ethtool support in the + network device driver of the slave interfaces, and the ARP + monitor is not available. + +balance-alb: + This mode is everything that balance-tlb is, and more. + It has all of the features (and restrictions) of balance-tlb, + and will also balance incoming traffic from local network + peers (as described in the Bonding Module Options section, + above). + + The only additional down side to this mode is that the network + device driver must support changing the hardware address while + the device is open. + +12.1.2 MT Link Monitoring for Single Switch Topology +---------------------------------------------------- + +The choice of link monitoring may largely depend upon which +mode you choose to use. The more advanced load balancing modes do not +support the use of the ARP monitor, and are thus restricted to using +the MII monitor (which does not provide as high a level of end to end +assurance as the ARP monitor). + +12.2 Maximum Throughput in a Multiple Switch Topology +----------------------------------------------------- + +Multiple switches may be utilized to optimize for throughput +when they are configured in parallel as part of an isolated network +between two or more systems, for example:: + + +-----------+ + | Host A | + +-+---+---+-+ + | | | + +--------+ | +---------+ + | | | + +------+---+ +-----+----+ +-----+----+ + | Switch A | | Switch B | | Switch C | + +------+---+ +-----+----+ +-----+----+ + | | | + +--------+ | +---------+ + | | | + +-+---+---+-+ + | Host B | + +-----------+ + +In this configuration, the switches are isolated from one +another. One reason to employ a topology such as this is for an +isolated network with many hosts (a cluster configured for high +performance, for example), using multiple smaller switches can be more +cost effective than a single larger switch, e.g., on a network with 24 +hosts, three 24 port switches can be significantly less expensive than +a single 72 port switch. + +If access beyond the network is required, an individual host +can be equipped with an additional network device connected to an +external network; this host then additionally acts as a gateway. + +12.2.1 MT Bonding Mode Selection for Multiple Switch Topology +------------------------------------------------------------- + +In actual practice, the bonding mode typically employed in +configurations of this type is balance-rr. Historically, in this +network configuration, the usual caveats about out of order packet +delivery are mitigated by the use of network adapters that do not do +any kind of packet coalescing (via the use of NAPI, or because the +device itself does not generate interrupts until some number of +packets has arrived). When employed in this fashion, the balance-rr +mode allows individual connections between two hosts to effectively +utilize greater than one interface's bandwidth. + +12.2.2 MT Link Monitoring for Multiple Switch Topology +------------------------------------------------------ + +Again, in actual practice, the MII monitor is most often used +in this configuration, as performance is given preference over +availability. The ARP monitor will function in this topology, but its +advantages over the MII monitor are mitigated by the volume of probes +needed as the number of systems involved grows (remember that each +host in the network is configured with bonding). + +13. Switch Behavior Issues +========================== + +13.1 Link Establishment and Failover Delays +------------------------------------------- + +Some switches exhibit undesirable behavior with regard to the +timing of link up and down reporting by the switch. + +First, when a link comes up, some switches may indicate that +the link is up (carrier available), but not pass traffic over the +interface for some period of time. This delay is typically due to +some type of autonegotiation or routing protocol, but may also occur +during switch initialization (e.g., during recovery after a switch +failure). If you find this to be a problem, specify an appropriate +value to the updelay bonding module option to delay the use of the +relevant interface(s). + +Second, some switches may "bounce" the link state one or more +times while a link is changing state. This occurs most commonly while +the switch is initializing. Again, an appropriate updelay value may +help. + +Note that when a bonding interface has no active links, the +driver will immediately reuse the first link that goes up, even if the +updelay parameter has been specified (the updelay is ignored in this +case). If there are slave interfaces waiting for the updelay timeout +to expire, the interface that first went into that state will be +immediately reused. This reduces down time of the network if the +value of updelay has been overestimated, and since this occurs only in +cases with no connectivity, there is no additional penalty for +ignoring the updelay. + +In addition to the concerns about switch timings, if your +switches take a long time to go into backup mode, it may be desirable +to not activate a backup interface immediately after a link goes down. +Failover may be delayed via the downdelay bonding module option. + +13.2 Duplicated Incoming Packets +-------------------------------- + +NOTE: Starting with version 3.0.2, the bonding driver has logic to +suppress duplicate packets, which should largely eliminate this problem. +The following description is kept for reference. + +It is not uncommon to observe a short burst of duplicated +traffic when the bonding device is first used, or after it has been +idle for some period of time. This is most easily observed by issuing +a "ping" to some other host on the network, and noticing that the +output from ping flags duplicates (typically one per slave). + +For example, on a bond in active-backup mode with five slaves +all connected to one switch, the output may appear as follows:: + + # ping -n 10.0.4.2 + PING 10.0.4.2 (10.0.4.2) from 10.0.3.10 : 56(84) bytes of data. + 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.7 ms + 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) + 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) + 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) + 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) + 64 bytes from 10.0.4.2: icmp_seq=2 ttl=64 time=0.216 ms + 64 bytes from 10.0.4.2: icmp_seq=3 ttl=64 time=0.267 ms + 64 bytes from 10.0.4.2: icmp_seq=4 ttl=64 time=0.222 ms + +This is not due to an error in the bonding driver, rather, it +is a side effect of how many switches update their MAC forwarding +tables. Initially, the switch does not associate the MAC address in +the packet with a particular switch port, and so it may send the +traffic to all ports until its MAC forwarding table is updated. Since +the interfaces attached to the bond may occupy multiple ports on a +single switch, when the switch (temporarily) floods the traffic to all +ports, the bond device receives multiple copies of the same packet +(one per slave device). + +The duplicated packet behavior is switch dependent, some +switches exhibit this, and some do not. On switches that display this +behavior, it can be induced by clearing the MAC forwarding table (on +most Cisco switches, the privileged command "clear mac address-table +dynamic" will accomplish this). + +14. Hardware Specific Considerations +==================================== + +This section contains additional information for configuring +bonding on specific hardware platforms, or for interfacing bonding +with particular switches or other devices. + +14.1 IBM BladeCenter +-------------------- + +This applies to the JS20 and similar systems. + +On the JS20 blades, the bonding driver supports only +balance-rr, active-backup, balance-tlb and balance-alb modes. This is +largely due to the network topology inside the BladeCenter, detailed +below. + +JS20 network adapter information +-------------------------------- + +All JS20s come with two Broadcom Gigabit Ethernet ports +integrated on the planar (that's "motherboard" in IBM-speak). In the +BladeCenter chassis, the eth0 port of all JS20 blades is hard wired to +I/O Module #1; similarly, all eth1 ports are wired to I/O Module #2. +An add-on Broadcom daughter card can be installed on a JS20 to provide +two more Gigabit Ethernet ports. These ports, eth2 and eth3, are +wired to I/O Modules 3 and 4, respectively. + +Each I/O Module may contain either a switch or a passthrough +module (which allows ports to be directly connected to an external +switch). Some bonding modes require a specific BladeCenter internal +network topology in order to function; these are detailed below. + +Additional BladeCenter-specific networking information can be +found in two IBM Redbooks (www.ibm.com/redbooks): + +- "IBM eServer BladeCenter Networking Options" +- "IBM eServer BladeCenter Layer 2-7 Network Switching" + +BladeCenter networking configuration +------------------------------------ + +Because a BladeCenter can be configured in a very large number +of ways, this discussion will be confined to describing basic +configurations. + +Normally, Ethernet Switch Modules (ESMs) are used in I/O +modules 1 and 2. In this configuration, the eth0 and eth1 ports of a +JS20 will be connected to different internal switches (in the +respective I/O modules). + +A passthrough module (OPM or CPM, optical or copper, +passthrough module) connects the I/O module directly to an external +switch. By using PMs in I/O module #1 and #2, the eth0 and eth1 +interfaces of a JS20 can be redirected to the outside world and +connected to a common external switch. + +Depending upon the mix of ESMs and PMs, the network will +appear to bonding as either a single switch topology (all PMs) or as a +multiple switch topology (one or more ESMs, zero or more PMs). It is +also possible to connect ESMs together, resulting in a configuration +much like the example in "High Availability in a Multiple Switch +Topology," above. + +Requirements for specific modes +------------------------------- + +The balance-rr mode requires the use of passthrough modules +for devices in the bond, all connected to an common external switch. +That switch must be configured for "etherchannel" or "trunking" on the +appropriate ports, as is usual for balance-rr. + +The balance-alb and balance-tlb modes will function with +either switch modules or passthrough modules (or a mix). The only +specific requirement for these modes is that all network interfaces +must be able to reach all destinations for traffic sent over the +bonding device (i.e., the network must converge at some point outside +the BladeCenter). + +The active-backup mode has no additional requirements. + +Link monitoring issues +---------------------- + +When an Ethernet Switch Module is in place, only the ARP +monitor will reliably detect link loss to an external switch. This is +nothing unusual, but examination of the BladeCenter cabinet would +suggest that the "external" network ports are the ethernet ports for +the system, when it fact there is a switch between these "external" +ports and the devices on the JS20 system itself. The MII monitor is +only able to detect link failures between the ESM and the JS20 system. + +When a passthrough module is in place, the MII monitor does +detect failures to the "external" port, which is then directly +connected to the JS20 system. + +Other concerns +-------------- + +The Serial Over LAN (SoL) link is established over the primary +ethernet (eth0) only, therefore, any loss of link to eth0 will result +in losing your SoL connection. It will not fail over with other +network traffic, as the SoL system is beyond the control of the +bonding driver. + +It may be desirable to disable spanning tree on the switch +(either the internal Ethernet Switch Module, or an external switch) to +avoid fail-over delay issues when using bonding. + + +15. Frequently Asked Questions +============================== + +1. Is it SMP safe? +------------------- + +Yes. The old 2.0.xx channel bonding patch was not SMP safe. +The new driver was designed to be SMP safe from the start. + +2. What type of cards will work with it? +----------------------------------------- + +Any Ethernet type cards (you can even mix cards - a Intel +EtherExpress PRO/100 and a 3com 3c905b, for example). For most modes, +devices need not be of the same speed. + +Starting with version 3.2.1, bonding also supports Infiniband +slaves in active-backup mode. + +3. How many bonding devices can I have? +---------------------------------------- + +There is no limit. + +4. How many slaves can a bonding device have? +---------------------------------------------- + +This is limited only by the number of network interfaces Linux +supports and/or the number of network cards you can place in your +system. + +5. What happens when a slave link dies? +---------------------------------------- + +If link monitoring is enabled, then the failing device will be +disabled. The active-backup mode will fail over to a backup link, and +other modes will ignore the failed link. The link will continue to be +monitored, and should it recover, it will rejoin the bond (in whatever +manner is appropriate for the mode). See the sections on High +Availability and the documentation for each mode for additional +information. + +Link monitoring can be enabled via either the miimon or +arp_interval parameters (described in the module parameters section, +above). In general, miimon monitors the carrier state as sensed by +the underlying network device, and the arp monitor (arp_interval) +monitors connectivity to another host on the local network. + +If no link monitoring is configured, the bonding driver will +be unable to detect link failures, and will assume that all links are +always available. This will likely result in lost packets, and a +resulting degradation of performance. The precise performance loss +depends upon the bonding mode and network configuration. + +6. Can bonding be used for High Availability? +---------------------------------------------- + +Yes. See the section on High Availability for details. + +7. Which switches/systems does it work with? +--------------------------------------------- + +The full answer to this depends upon the desired mode. + +In the basic balance modes (balance-rr and balance-xor), it +works with any system that supports etherchannel (also called +trunking). Most managed switches currently available have such +support, and many unmanaged switches as well. + +The advanced balance modes (balance-tlb and balance-alb) do +not have special switch requirements, but do need device drivers that +support specific features (described in the appropriate section under +module parameters, above). + +In 802.3ad mode, it works with systems that support IEEE +802.3ad Dynamic Link Aggregation. Most managed and many unmanaged +switches currently available support 802.3ad. + +The active-backup mode should work with any Layer-II switch. + +8. Where does a bonding device get its MAC address from? +--------------------------------------------------------- + +When using slave devices that have fixed MAC addresses, or when +the fail_over_mac option is enabled, the bonding device's MAC address is +the MAC address of the active slave. + +For other configurations, if not explicitly configured (with +ifconfig or ip link), the MAC address of the bonding device is taken from +its first slave device. This MAC address is then passed to all following +slaves and remains persistent (even if the first slave is removed) until +the bonding device is brought down or reconfigured. + +If you wish to change the MAC address, you can set it with +ifconfig or ip link:: + + # ifconfig bond0 hw ether 00:11:22:33:44:55 + + # ip link set bond0 address 66:77:88:99:aa:bb + +The MAC address can be also changed by bringing down/up the +device and then changing its slaves (or their order):: + + # ifconfig bond0 down ; modprobe -r bonding + # ifconfig bond0 .... up + # ifenslave bond0 eth... + +This method will automatically take the address from the next +slave that is added. + +To restore your slaves' MAC addresses, you need to detach them +from the bond (``ifenslave -d bond0 eth0``). The bonding driver will +then restore the MAC addresses that the slaves had before they were +enslaved. + +16. Resources and Links +======================= + +The latest version of the bonding driver can be found in the latest +version of the linux kernel, found on http://kernel.org + +The latest version of this document can be found in the latest kernel +source (named Documentation/networking/bonding.rst). + +Discussions regarding the usage of the bonding driver take place on the +bonding-devel mailing list, hosted at sourceforge.net. If you have questions or +problems, post them to the list. The list address is: + +bonding-devel@lists.sourceforge.net + +The administrative interface (to subscribe or unsubscribe) can +be found at: + +https://lists.sourceforge.net/lists/listinfo/bonding-devel + +Discussions regarding the development of the bonding driver take place +on the main Linux network mailing list, hosted at vger.kernel.org. The list +address is: + +netdev@vger.kernel.org + +The administrative interface (to subscribe or unsubscribe) can +be found at: + +http://vger.kernel.org/vger-lists.html#netdev + +Donald Becker's Ethernet Drivers and diag programs may be found at : + + - http://web.archive.org/web/%2E/http://www.scyld.com/network/ + +You will also find a lot of information regarding Ethernet, NWay, MII, +etc. at www.scyld.com. diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt deleted file mode 100644 index e3abfbd32f71..000000000000 --- a/Documentation/networking/bonding.txt +++ /dev/null @@ -1,2837 +0,0 @@ - - Linux Ethernet Bonding Driver HOWTO - - Latest update: 27 April 2011 - -Initial release : Thomas Davis -Corrections, HA extensions : 2000/10/03-15 : - - Willy Tarreau - - Constantine Gavrilov - - Chad N. Tindel - - Janice Girouard - - Jay Vosburgh - -Reorganized and updated Feb 2005 by Jay Vosburgh -Added Sysfs information: 2006/04/24 - - Mitch Williams - -Introduction -============ - - The Linux bonding driver provides a method for aggregating -multiple network interfaces into a single logical "bonded" interface. -The behavior of the bonded interfaces depends upon the mode; generally -speaking, modes provide either hot standby or load balancing services. -Additionally, link integrity monitoring may be performed. - - The bonding driver originally came from Donald Becker's -beowulf patches for kernel 2.0. It has changed quite a bit since, and -the original tools from extreme-linux and beowulf sites will not work -with this version of the driver. - - For new versions of the driver, updated userspace tools, and -who to ask for help, please follow the links at the end of this file. - -Table of Contents -================= - -1. Bonding Driver Installation - -2. Bonding Driver Options - -3. Configuring Bonding Devices -3.1 Configuration with Sysconfig Support -3.1.1 Using DHCP with Sysconfig -3.1.2 Configuring Multiple Bonds with Sysconfig -3.2 Configuration with Initscripts Support -3.2.1 Using DHCP with Initscripts -3.2.2 Configuring Multiple Bonds with Initscripts -3.3 Configuring Bonding Manually with Ifenslave -3.3.1 Configuring Multiple Bonds Manually -3.4 Configuring Bonding Manually via Sysfs -3.5 Configuration with Interfaces Support -3.6 Overriding Configuration for Special Cases -3.7 Configuring LACP for 802.3ad mode in a more secure way - -4. Querying Bonding Configuration -4.1 Bonding Configuration -4.2 Network Configuration - -5. Switch Configuration - -6. 802.1q VLAN Support - -7. Link Monitoring -7.1 ARP Monitor Operation -7.2 Configuring Multiple ARP Targets -7.3 MII Monitor Operation - -8. Potential Trouble Sources -8.1 Adventures in Routing -8.2 Ethernet Device Renaming -8.3 Painfully Slow Or No Failed Link Detection By Miimon - -9. SNMP agents - -10. Promiscuous mode - -11. Configuring Bonding for High Availability -11.1 High Availability in a Single Switch Topology -11.2 High Availability in a Multiple Switch Topology -11.2.1 HA Bonding Mode Selection for Multiple Switch Topology -11.2.2 HA Link Monitoring for Multiple Switch Topology - -12. Configuring Bonding for Maximum Throughput -12.1 Maximum Throughput in a Single Switch Topology -12.1.1 MT Bonding Mode Selection for Single Switch Topology -12.1.2 MT Link Monitoring for Single Switch Topology -12.2 Maximum Throughput in a Multiple Switch Topology -12.2.1 MT Bonding Mode Selection for Multiple Switch Topology -12.2.2 MT Link Monitoring for Multiple Switch Topology - -13. Switch Behavior Issues -13.1 Link Establishment and Failover Delays -13.2 Duplicated Incoming Packets - -14. Hardware Specific Considerations -14.1 IBM BladeCenter - -15. Frequently Asked Questions - -16. Resources and Links - - -1. Bonding Driver Installation -============================== - - Most popular distro kernels ship with the bonding driver -already available as a module. If your distro does not, or you -have need to compile bonding from source (e.g., configuring and -installing a mainline kernel from kernel.org), you'll need to perform -the following steps: - -1.1 Configure and build the kernel with bonding ------------------------------------------------ - - The current version of the bonding driver is available in the -drivers/net/bonding subdirectory of the most recent kernel source -(which is available on http://kernel.org). Most users "rolling their -own" will want to use the most recent kernel from kernel.org. - - Configure kernel with "make menuconfig" (or "make xconfig" or -"make config"), then select "Bonding driver support" in the "Network -device support" section. It is recommended that you configure the -driver as module since it is currently the only way to pass parameters -to the driver or configure more than one bonding device. - - Build and install the new kernel and modules. - -1.2 Bonding Control Utility -------------------------------------- - - It is recommended to configure bonding via iproute2 (netlink) -or sysfs, the old ifenslave control utility is obsolete. - -2. Bonding Driver Options -========================= - - Options for the bonding driver are supplied as parameters to the -bonding module at load time, or are specified via sysfs. - - Module options may be given as command line arguments to the -insmod or modprobe command, but are usually specified in either the -/etc/modprobe.d/*.conf configuration files, or in a distro-specific -configuration file (some of which are detailed in the next section). - - Details on bonding support for sysfs is provided in the -"Configuring Bonding Manually via Sysfs" section, below. - - The available bonding driver parameters are listed below. If a -parameter is not specified the default value is used. When initially -configuring a bond, it is recommended "tail -f /var/log/messages" be -run in a separate window to watch for bonding driver error messages. - - It is critical that either the miimon or arp_interval and -arp_ip_target parameters be specified, otherwise serious network -degradation will occur during link failures. Very few devices do not -support at least miimon, so there is really no reason not to use it. - - Options with textual values will accept either the text name -or, for backwards compatibility, the option value. E.g., -"mode=802.3ad" and "mode=4" set the same mode. - - The parameters are as follows: - -active_slave - - Specifies the new active slave for modes that support it - (active-backup, balance-alb and balance-tlb). Possible values - are the name of any currently enslaved interface, or an empty - string. If a name is given, the slave and its link must be up in order - to be selected as the new active slave. If an empty string is - specified, the current active slave is cleared, and a new active - slave is selected automatically. - - Note that this is only available through the sysfs interface. No module - parameter by this name exists. - - The normal value of this option is the name of the currently - active slave, or the empty string if there is no active slave or - the current mode does not use an active slave. - -ad_actor_sys_prio - - In an AD system, this specifies the system priority. The allowed range - is 1 - 65535. If the value is not specified, it takes 65535 as the - default value. - - This parameter has effect only in 802.3ad mode and is available through - SysFs interface. - -ad_actor_system - - In an AD system, this specifies the mac-address for the actor in - protocol packet exchanges (LACPDUs). The value cannot be NULL or - multicast. It is preferred to have the local-admin bit set for this - mac but driver does not enforce it. If the value is not given then - system defaults to using the masters' mac address as actors' system - address. - - This parameter has effect only in 802.3ad mode and is available through - SysFs interface. - -ad_select - - Specifies the 802.3ad aggregation selection logic to use. The - possible values and their effects are: - - stable or 0 - - The active aggregator is chosen by largest aggregate - bandwidth. - - Reselection of the active aggregator occurs only when all - slaves of the active aggregator are down or the active - aggregator has no slaves. - - This is the default value. - - bandwidth or 1 - - The active aggregator is chosen by largest aggregate - bandwidth. Reselection occurs if: - - - A slave is added to or removed from the bond - - - Any slave's link state changes - - - Any slave's 802.3ad association state changes - - - The bond's administrative state changes to up - - count or 2 - - The active aggregator is chosen by the largest number of - ports (slaves). Reselection occurs as described under the - "bandwidth" setting, above. - - The bandwidth and count selection policies permit failover of - 802.3ad aggregations when partial failure of the active aggregator - occurs. This keeps the aggregator with the highest availability - (either in bandwidth or in number of ports) active at all times. - - This option was added in bonding version 3.4.0. - -ad_user_port_key - - In an AD system, the port-key has three parts as shown below - - - Bits Use - 00 Duplex - 01-05 Speed - 06-15 User-defined - - This defines the upper 10 bits of the port key. The values can be - from 0 - 1023. If not given, the system defaults to 0. - - This parameter has effect only in 802.3ad mode and is available through - SysFs interface. - -all_slaves_active - - Specifies that duplicate frames (received on inactive ports) should be - dropped (0) or delivered (1). - - Normally, bonding will drop duplicate frames (received on inactive - ports), which is desirable for most users. But there are some times - it is nice to allow duplicate frames to be delivered. - - The default value is 0 (drop duplicate frames received on inactive - ports). - -arp_interval - - Specifies the ARP link monitoring frequency in milliseconds. - - The ARP monitor works by periodically checking the slave - devices to determine whether they have sent or received - traffic recently (the precise criteria depends upon the - bonding mode, and the state of the slave). Regular traffic is - generated via ARP probes issued for the addresses specified by - the arp_ip_target option. - - This behavior can be modified by the arp_validate option, - below. - - If ARP monitoring is used in an etherchannel compatible mode - (modes 0 and 2), the switch should be configured in a mode - that evenly distributes packets across all links. If the - switch is configured to distribute the packets in an XOR - fashion, all replies from the ARP targets will be received on - the same link which could cause the other team members to - fail. ARP monitoring should not be used in conjunction with - miimon. A value of 0 disables ARP monitoring. The default - value is 0. - -arp_ip_target - - Specifies the IP addresses to use as ARP monitoring peers when - arp_interval is > 0. These are the targets of the ARP request - sent to determine the health of the link to the targets. - Specify these values in ddd.ddd.ddd.ddd format. Multiple IP - addresses must be separated by a comma. At least one IP - address must be given for ARP monitoring to function. The - maximum number of targets that can be specified is 16. The - default value is no IP addresses. - -arp_validate - - Specifies whether or not ARP probes and replies should be - validated in any mode that supports arp monitoring, or whether - non-ARP traffic should be filtered (disregarded) for link - monitoring purposes. - - Possible values are: - - none or 0 - - No validation or filtering is performed. - - active or 1 - - Validation is performed only for the active slave. - - backup or 2 - - Validation is performed only for backup slaves. - - all or 3 - - Validation is performed for all slaves. - - filter or 4 - - Filtering is applied to all slaves. No validation is - performed. - - filter_active or 5 - - Filtering is applied to all slaves, validation is performed - only for the active slave. - - filter_backup or 6 - - Filtering is applied to all slaves, validation is performed - only for backup slaves. - - Validation: - - Enabling validation causes the ARP monitor to examine the incoming - ARP requests and replies, and only consider a slave to be up if it - is receiving the appropriate ARP traffic. - - For an active slave, the validation checks ARP replies to confirm - that they were generated by an arp_ip_target. Since backup slaves - do not typically receive these replies, the validation performed - for backup slaves is on the broadcast ARP request sent out via the - active slave. It is possible that some switch or network - configurations may result in situations wherein the backup slaves - do not receive the ARP requests; in such a situation, validation - of backup slaves must be disabled. - - The validation of ARP requests on backup slaves is mainly helping - bonding to decide which slaves are more likely to work in case of - the active slave failure, it doesn't really guarantee that the - backup slave will work if it's selected as the next active slave. - - Validation is useful in network configurations in which multiple - bonding hosts are concurrently issuing ARPs to one or more targets - beyond a common switch. Should the link between the switch and - target fail (but not the switch itself), the probe traffic - generated by the multiple bonding instances will fool the standard - ARP monitor into considering the links as still up. Use of - validation can resolve this, as the ARP monitor will only consider - ARP requests and replies associated with its own instance of - bonding. - - Filtering: - - Enabling filtering causes the ARP monitor to only use incoming ARP - packets for link availability purposes. Arriving packets that are - not ARPs are delivered normally, but do not count when determining - if a slave is available. - - Filtering operates by only considering the reception of ARP - packets (any ARP packet, regardless of source or destination) when - determining if a slave has received traffic for link availability - purposes. - - Filtering is useful in network configurations in which significant - levels of third party broadcast traffic would fool the standard - ARP monitor into considering the links as still up. Use of - filtering can resolve this, as only ARP traffic is considered for - link availability purposes. - - This option was added in bonding version 3.1.0. - -arp_all_targets - - Specifies the quantity of arp_ip_targets that must be reachable - in order for the ARP monitor to consider a slave as being up. - This option affects only active-backup mode for slaves with - arp_validation enabled. - - Possible values are: - - any or 0 - - consider the slave up only when any of the arp_ip_targets - is reachable - - all or 1 - - consider the slave up only when all of the arp_ip_targets - are reachable - -downdelay - - Specifies the time, in milliseconds, to wait before disabling - a slave after a link failure has been detected. This option - is only valid for the miimon link monitor. The downdelay - value should be a multiple of the miimon value; if not, it - will be rounded down to the nearest multiple. The default - value is 0. - -fail_over_mac - - Specifies whether active-backup mode should set all slaves to - the same MAC address at enslavement (the traditional - behavior), or, when enabled, perform special handling of the - bond's MAC address in accordance with the selected policy. - - Possible values are: - - none or 0 - - This setting disables fail_over_mac, and causes - bonding to set all slaves of an active-backup bond to - the same MAC address at enslavement time. This is the - default. - - active or 1 - - The "active" fail_over_mac policy indicates that the - MAC address of the bond should always be the MAC - address of the currently active slave. The MAC - address of the slaves is not changed; instead, the MAC - address of the bond changes during a failover. - - This policy is useful for devices that cannot ever - alter their MAC address, or for devices that refuse - incoming broadcasts with their own source MAC (which - interferes with the ARP monitor). - - The down side of this policy is that every device on - the network must be updated via gratuitous ARP, - vs. just updating a switch or set of switches (which - often takes place for any traffic, not just ARP - traffic, if the switch snoops incoming traffic to - update its tables) for the traditional method. If the - gratuitous ARP is lost, communication may be - disrupted. - - When this policy is used in conjunction with the mii - monitor, devices which assert link up prior to being - able to actually transmit and receive are particularly - susceptible to loss of the gratuitous ARP, and an - appropriate updelay setting may be required. - - follow or 2 - - The "follow" fail_over_mac policy causes the MAC - address of the bond to be selected normally (normally - the MAC address of the first slave added to the bond). - However, the second and subsequent slaves are not set - to this MAC address while they are in a backup role; a - slave is programmed with the bond's MAC address at - failover time (and the formerly active slave receives - the newly active slave's MAC address). - - This policy is useful for multiport devices that - either become confused or incur a performance penalty - when multiple ports are programmed with the same MAC - address. - - - The default policy is none, unless the first slave cannot - change its MAC address, in which case the active policy is - selected by default. - - This option may be modified via sysfs only when no slaves are - present in the bond. - - This option was added in bonding version 3.2.0. The "follow" - policy was added in bonding version 3.3.0. - -lacp_rate - - Option specifying the rate in which we'll ask our link partner - to transmit LACPDU packets in 802.3ad mode. Possible values - are: - - slow or 0 - Request partner to transmit LACPDUs every 30 seconds - - fast or 1 - Request partner to transmit LACPDUs every 1 second - - The default is slow. - -max_bonds - - Specifies the number of bonding devices to create for this - instance of the bonding driver. E.g., if max_bonds is 3, and - the bonding driver is not already loaded, then bond0, bond1 - and bond2 will be created. The default value is 1. Specifying - a value of 0 will load bonding, but will not create any devices. - -miimon - - Specifies the MII link monitoring frequency in milliseconds. - This determines how often the link state of each slave is - inspected for link failures. A value of zero disables MII - link monitoring. A value of 100 is a good starting point. - The use_carrier option, below, affects how the link state is - determined. See the High Availability section for additional - information. The default value is 0. - -min_links - - Specifies the minimum number of links that must be active before - asserting carrier. It is similar to the Cisco EtherChannel min-links - feature. This allows setting the minimum number of member ports that - must be up (link-up state) before marking the bond device as up - (carrier on). This is useful for situations where higher level services - such as clustering want to ensure a minimum number of low bandwidth - links are active before switchover. This option only affect 802.3ad - mode. - - The default value is 0. This will cause carrier to be asserted (for - 802.3ad mode) whenever there is an active aggregator, regardless of the - number of available links in that aggregator. Note that, because an - aggregator cannot be active without at least one available link, - setting this option to 0 or to 1 has the exact same effect. - -mode - - Specifies one of the bonding policies. The default is - balance-rr (round robin). Possible values are: - - balance-rr or 0 - - Round-robin policy: Transmit packets in sequential - order from the first available slave through the - last. This mode provides load balancing and fault - tolerance. - - active-backup or 1 - - Active-backup policy: Only one slave in the bond is - active. A different slave becomes active if, and only - if, the active slave fails. The bond's MAC address is - externally visible on only one port (network adapter) - to avoid confusing the switch. - - In bonding version 2.6.2 or later, when a failover - occurs in active-backup mode, bonding will issue one - or more gratuitous ARPs on the newly active slave. - One gratuitous ARP is issued for the bonding master - interface and each VLAN interfaces configured above - it, provided that the interface has at least one IP - address configured. Gratuitous ARPs issued for VLAN - interfaces are tagged with the appropriate VLAN id. - - This mode provides fault tolerance. The primary - option, documented below, affects the behavior of this - mode. - - balance-xor or 2 - - XOR policy: Transmit based on the selected transmit - hash policy. The default policy is a simple [(source - MAC address XOR'd with destination MAC address XOR - packet type ID) modulo slave count]. Alternate transmit - policies may be selected via the xmit_hash_policy option, - described below. - - This mode provides load balancing and fault tolerance. - - broadcast or 3 - - Broadcast policy: transmits everything on all slave - interfaces. This mode provides fault tolerance. - - 802.3ad or 4 - - IEEE 802.3ad Dynamic link aggregation. Creates - aggregation groups that share the same speed and - duplex settings. Utilizes all slaves in the active - aggregator according to the 802.3ad specification. - - Slave selection for outgoing traffic is done according - to the transmit hash policy, which may be changed from - the default simple XOR policy via the xmit_hash_policy - option, documented below. Note that not all transmit - policies may be 802.3ad compliant, particularly in - regards to the packet mis-ordering requirements of - section 43.2.4 of the 802.3ad standard. Differing - peer implementations will have varying tolerances for - noncompliance. - - Prerequisites: - - 1. Ethtool support in the base drivers for retrieving - the speed and duplex of each slave. - - 2. A switch that supports IEEE 802.3ad Dynamic link - aggregation. - - Most switches will require some type of configuration - to enable 802.3ad mode. - - balance-tlb or 5 - - Adaptive transmit load balancing: channel bonding that - does not require any special switch support. - - In tlb_dynamic_lb=1 mode; the outgoing traffic is - distributed according to the current load (computed - relative to the speed) on each slave. - - In tlb_dynamic_lb=0 mode; the load balancing based on - current load is disabled and the load is distributed - only using the hash distribution. - - Incoming traffic is received by the current slave. - If the receiving slave fails, another slave takes over - the MAC address of the failed receiving slave. - - Prerequisite: - - Ethtool support in the base drivers for retrieving the - speed of each slave. - - balance-alb or 6 - - Adaptive load balancing: includes balance-tlb plus - receive load balancing (rlb) for IPV4 traffic, and - does not require any special switch support. The - receive load balancing is achieved by ARP negotiation. - The bonding driver intercepts the ARP Replies sent by - the local system on their way out and overwrites the - source hardware address with the unique hardware - address of one of the slaves in the bond such that - different peers use different hardware addresses for - the server. - - Receive traffic from connections created by the server - is also balanced. When the local system sends an ARP - Request the bonding driver copies and saves the peer's - IP information from the ARP packet. When the ARP - Reply arrives from the peer, its hardware address is - retrieved and the bonding driver initiates an ARP - reply to this peer assigning it to one of the slaves - in the bond. A problematic outcome of using ARP - negotiation for balancing is that each time that an - ARP request is broadcast it uses the hardware address - of the bond. Hence, peers learn the hardware address - of the bond and the balancing of receive traffic - collapses to the current slave. This is handled by - sending updates (ARP Replies) to all the peers with - their individually assigned hardware address such that - the traffic is redistributed. Receive traffic is also - redistributed when a new slave is added to the bond - and when an inactive slave is re-activated. The - receive load is distributed sequentially (round robin) - among the group of highest speed slaves in the bond. - - When a link is reconnected or a new slave joins the - bond the receive traffic is redistributed among all - active slaves in the bond by initiating ARP Replies - with the selected MAC address to each of the - clients. The updelay parameter (detailed below) must - be set to a value equal or greater than the switch's - forwarding delay so that the ARP Replies sent to the - peers will not be blocked by the switch. - - Prerequisites: - - 1. Ethtool support in the base drivers for retrieving - the speed of each slave. - - 2. Base driver support for setting the hardware - address of a device while it is open. This is - required so that there will always be one slave in the - team using the bond hardware address (the - curr_active_slave) while having a unique hardware - address for each slave in the bond. If the - curr_active_slave fails its hardware address is - swapped with the new curr_active_slave that was - chosen. - -num_grat_arp -num_unsol_na - - Specify the number of peer notifications (gratuitous ARPs and - unsolicited IPv6 Neighbor Advertisements) to be issued after a - failover event. As soon as the link is up on the new slave - (possibly immediately) a peer notification is sent on the - bonding device and each VLAN sub-device. This is repeated at - the rate specified by peer_notif_delay if the number is - greater than 1. - - The valid range is 0 - 255; the default value is 1. These options - affect only the active-backup mode. These options were added for - bonding versions 3.3.0 and 3.4.0 respectively. - - From Linux 3.0 and bonding version 3.7.1, these notifications - are generated by the ipv4 and ipv6 code and the numbers of - repetitions cannot be set independently. - -packets_per_slave - - Specify the number of packets to transmit through a slave before - moving to the next one. When set to 0 then a slave is chosen at - random. - - The valid range is 0 - 65535; the default value is 1. This option - has effect only in balance-rr mode. - -peer_notif_delay - - Specify the delay, in milliseconds, between each peer - notification (gratuitous ARP and unsolicited IPv6 Neighbor - Advertisement) when they are issued after a failover event. - This delay should be a multiple of the link monitor interval - (arp_interval or miimon, whichever is active). The default - value is 0 which means to match the value of the link monitor - interval. - -primary - - A string (eth0, eth2, etc) specifying which slave is the - primary device. The specified device will always be the - active slave while it is available. Only when the primary is - off-line will alternate devices be used. This is useful when - one slave is preferred over another, e.g., when one slave has - higher throughput than another. - - The primary option is only valid for active-backup(1), - balance-tlb (5) and balance-alb (6) mode. - -primary_reselect - - Specifies the reselection policy for the primary slave. This - affects how the primary slave is chosen to become the active slave - when failure of the active slave or recovery of the primary slave - occurs. This option is designed to prevent flip-flopping between - the primary slave and other slaves. Possible values are: - - always or 0 (default) - - The primary slave becomes the active slave whenever it - comes back up. - - better or 1 - - The primary slave becomes the active slave when it comes - back up, if the speed and duplex of the primary slave is - better than the speed and duplex of the current active - slave. - - failure or 2 - - The primary slave becomes the active slave only if the - current active slave fails and the primary slave is up. - - The primary_reselect setting is ignored in two cases: - - If no slaves are active, the first slave to recover is - made the active slave. - - When initially enslaved, the primary slave is always made - the active slave. - - Changing the primary_reselect policy via sysfs will cause an - immediate selection of the best active slave according to the new - policy. This may or may not result in a change of the active - slave, depending upon the circumstances. - - This option was added for bonding version 3.6.0. - -tlb_dynamic_lb - - Specifies if dynamic shuffling of flows is enabled in tlb - mode. The value has no effect on any other modes. - - The default behavior of tlb mode is to shuffle active flows across - slaves based on the load in that interval. This gives nice lb - characteristics but can cause packet reordering. If re-ordering is - a concern use this variable to disable flow shuffling and rely on - load balancing provided solely by the hash distribution. - xmit-hash-policy can be used to select the appropriate hashing for - the setup. - - The sysfs entry can be used to change the setting per bond device - and the initial value is derived from the module parameter. The - sysfs entry is allowed to be changed only if the bond device is - down. - - The default value is "1" that enables flow shuffling while value "0" - disables it. This option was added in bonding driver 3.7.1 - - -updelay - - Specifies the time, in milliseconds, to wait before enabling a - slave after a link recovery has been detected. This option is - only valid for the miimon link monitor. The updelay value - should be a multiple of the miimon value; if not, it will be - rounded down to the nearest multiple. The default value is 0. - -use_carrier - - Specifies whether or not miimon should use MII or ETHTOOL - ioctls vs. netif_carrier_ok() to determine the link - status. The MII or ETHTOOL ioctls are less efficient and - utilize a deprecated calling sequence within the kernel. The - netif_carrier_ok() relies on the device driver to maintain its - state with netif_carrier_on/off; at this writing, most, but - not all, device drivers support this facility. - - If bonding insists that the link is up when it should not be, - it may be that your network device driver does not support - netif_carrier_on/off. The default state for netif_carrier is - "carrier on," so if a driver does not support netif_carrier, - it will appear as if the link is always up. In this case, - setting use_carrier to 0 will cause bonding to revert to the - MII / ETHTOOL ioctl method to determine the link state. - - A value of 1 enables the use of netif_carrier_ok(), a value of - 0 will use the deprecated MII / ETHTOOL ioctls. The default - value is 1. - -xmit_hash_policy - - Selects the transmit hash policy to use for slave selection in - balance-xor, 802.3ad, and tlb modes. Possible values are: - - layer2 - - Uses XOR of hardware MAC addresses and packet type ID - field to generate the hash. The formula is - - hash = source MAC XOR destination MAC XOR packet type ID - slave number = hash modulo slave count - - This algorithm will place all traffic to a particular - network peer on the same slave. - - This algorithm is 802.3ad compliant. - - layer2+3 - - This policy uses a combination of layer2 and layer3 - protocol information to generate the hash. - - Uses XOR of hardware MAC addresses and IP addresses to - generate the hash. The formula is - - hash = source MAC XOR destination MAC XOR packet type ID - hash = hash XOR source IP XOR destination IP - hash = hash XOR (hash RSHIFT 16) - hash = hash XOR (hash RSHIFT 8) - And then hash is reduced modulo slave count. - - If the protocol is IPv6 then the source and destination - addresses are first hashed using ipv6_addr_hash. - - This algorithm will place all traffic to a particular - network peer on the same slave. For non-IP traffic, - the formula is the same as for the layer2 transmit - hash policy. - - This policy is intended to provide a more balanced - distribution of traffic than layer2 alone, especially - in environments where a layer3 gateway device is - required to reach most destinations. - - This algorithm is 802.3ad compliant. - - layer3+4 - - This policy uses upper layer protocol information, - when available, to generate the hash. This allows for - traffic to a particular network peer to span multiple - slaves, although a single connection will not span - multiple slaves. - - The formula for unfragmented TCP and UDP packets is - - hash = source port, destination port (as in the header) - hash = hash XOR source IP XOR destination IP - hash = hash XOR (hash RSHIFT 16) - hash = hash XOR (hash RSHIFT 8) - And then hash is reduced modulo slave count. - - If the protocol is IPv6 then the source and destination - addresses are first hashed using ipv6_addr_hash. - - For fragmented TCP or UDP packets and all other IPv4 and - IPv6 protocol traffic, the source and destination port - information is omitted. For non-IP traffic, the - formula is the same as for the layer2 transmit hash - policy. - - This algorithm is not fully 802.3ad compliant. A - single TCP or UDP conversation containing both - fragmented and unfragmented packets will see packets - striped across two interfaces. This may result in out - of order delivery. Most traffic types will not meet - this criteria, as TCP rarely fragments traffic, and - most UDP traffic is not involved in extended - conversations. Other implementations of 802.3ad may - or may not tolerate this noncompliance. - - encap2+3 - - This policy uses the same formula as layer2+3 but it - relies on skb_flow_dissect to obtain the header fields - which might result in the use of inner headers if an - encapsulation protocol is used. For example this will - improve the performance for tunnel users because the - packets will be distributed according to the encapsulated - flows. - - encap3+4 - - This policy uses the same formula as layer3+4 but it - relies on skb_flow_dissect to obtain the header fields - which might result in the use of inner headers if an - encapsulation protocol is used. For example this will - improve the performance for tunnel users because the - packets will be distributed according to the encapsulated - flows. - - The default value is layer2. This option was added in bonding - version 2.6.3. In earlier versions of bonding, this parameter - does not exist, and the layer2 policy is the only policy. The - layer2+3 value was added for bonding version 3.2.2. - -resend_igmp - - Specifies the number of IGMP membership reports to be issued after - a failover event. One membership report is issued immediately after - the failover, subsequent packets are sent in each 200ms interval. - - The valid range is 0 - 255; the default value is 1. A value of 0 - prevents the IGMP membership report from being issued in response - to the failover event. - - This option is useful for bonding modes balance-rr (0), active-backup - (1), balance-tlb (5) and balance-alb (6), in which a failover can - switch the IGMP traffic from one slave to another. Therefore a fresh - IGMP report must be issued to cause the switch to forward the incoming - IGMP traffic over the newly selected slave. - - This option was added for bonding version 3.7.0. - -lp_interval - - Specifies the number of seconds between instances where the bonding - driver sends learning packets to each slaves peer switch. - - The valid range is 1 - 0x7fffffff; the default value is 1. This Option - has effect only in balance-tlb and balance-alb modes. - -3. Configuring Bonding Devices -============================== - - You can configure bonding using either your distro's network -initialization scripts, or manually using either iproute2 or the -sysfs interface. Distros generally use one of three packages for the -network initialization scripts: initscripts, sysconfig or interfaces. -Recent versions of these packages have support for bonding, while older -versions do not. - - We will first describe the options for configuring bonding for -distros using versions of initscripts, sysconfig and interfaces with full -or partial support for bonding, then provide information on enabling -bonding without support from the network initialization scripts (i.e., -older versions of initscripts or sysconfig). - - If you're unsure whether your distro uses sysconfig, -initscripts or interfaces, or don't know if it's new enough, have no fear. -Determining this is fairly straightforward. - - First, look for a file called interfaces in /etc/network directory. -If this file is present in your system, then your system use interfaces. See -Configuration with Interfaces Support. - - Else, issue the command: - -$ rpm -qf /sbin/ifup - - It will respond with a line of text starting with either -"initscripts" or "sysconfig," followed by some numbers. This is the -package that provides your network initialization scripts. - - Next, to determine if your installation supports bonding, -issue the command: - -$ grep ifenslave /sbin/ifup - - If this returns any matches, then your initscripts or -sysconfig has support for bonding. - -3.1 Configuration with Sysconfig Support ----------------------------------------- - - This section applies to distros using a version of sysconfig -with bonding support, for example, SuSE Linux Enterprise Server 9. - - SuSE SLES 9's networking configuration system does support -bonding, however, at this writing, the YaST system configuration -front end does not provide any means to work with bonding devices. -Bonding devices can be managed by hand, however, as follows. - - First, if they have not already been configured, configure the -slave devices. On SLES 9, this is most easily done by running the -yast2 sysconfig configuration utility. The goal is for to create an -ifcfg-id file for each slave device. The simplest way to accomplish -this is to configure the devices for DHCP (this is only to get the -file ifcfg-id file created; see below for some issues with DHCP). The -name of the configuration file for each device will be of the form: - -ifcfg-id-xx:xx:xx:xx:xx:xx - - Where the "xx" portion will be replaced with the digits from -the device's permanent MAC address. - - Once the set of ifcfg-id-xx:xx:xx:xx:xx:xx files has been -created, it is necessary to edit the configuration files for the slave -devices (the MAC addresses correspond to those of the slave devices). -Before editing, the file will contain multiple lines, and will look -something like this: - -BOOTPROTO='dhcp' -STARTMODE='on' -USERCTL='no' -UNIQUE='XNzu.WeZGOGF+4wE' -_nm_name='bus-pci-0001:61:01.0' - - Change the BOOTPROTO and STARTMODE lines to the following: - -BOOTPROTO='none' -STARTMODE='off' - - Do not alter the UNIQUE or _nm_name lines. Remove any other -lines (USERCTL, etc). - - Once the ifcfg-id-xx:xx:xx:xx:xx:xx files have been modified, -it's time to create the configuration file for the bonding device -itself. This file is named ifcfg-bondX, where X is the number of the -bonding device to create, starting at 0. The first such file is -ifcfg-bond0, the second is ifcfg-bond1, and so on. The sysconfig -network configuration system will correctly start multiple instances -of bonding. - - The contents of the ifcfg-bondX file is as follows: - -BOOTPROTO="static" -BROADCAST="10.0.2.255" -IPADDR="10.0.2.10" -NETMASK="255.255.0.0" -NETWORK="10.0.2.0" -REMOTE_IPADDR="" -STARTMODE="onboot" -BONDING_MASTER="yes" -BONDING_MODULE_OPTS="mode=active-backup miimon=100" -BONDING_SLAVE0="eth0" -BONDING_SLAVE1="bus-pci-0000:06:08.1" - - Replace the sample BROADCAST, IPADDR, NETMASK and NETWORK -values with the appropriate values for your network. - - The STARTMODE specifies when the device is brought online. -The possible values are: - - onboot: The device is started at boot time. If you're not - sure, this is probably what you want. - - manual: The device is started only when ifup is called - manually. Bonding devices may be configured this - way if you do not wish them to start automatically - at boot for some reason. - - hotplug: The device is started by a hotplug event. This is not - a valid choice for a bonding device. - - off or ignore: The device configuration is ignored. - - The line BONDING_MASTER='yes' indicates that the device is a -bonding master device. The only useful value is "yes." - - The contents of BONDING_MODULE_OPTS are supplied to the -instance of the bonding module for this device. Specify the options -for the bonding mode, link monitoring, and so on here. Do not include -the max_bonds bonding parameter; this will confuse the configuration -system if you have multiple bonding devices. - - Finally, supply one BONDING_SLAVEn="slave device" for each -slave. where "n" is an increasing value, one for each slave. The -"slave device" is either an interface name, e.g., "eth0", or a device -specifier for the network device. The interface name is easier to -find, but the ethN names are subject to change at boot time if, e.g., -a device early in the sequence has failed. The device specifiers -(bus-pci-0000:06:08.1 in the example above) specify the physical -network device, and will not change unless the device's bus location -changes (for example, it is moved from one PCI slot to another). The -example above uses one of each type for demonstration purposes; most -configurations will choose one or the other for all slave devices. - - When all configuration files have been modified or created, -networking must be restarted for the configuration changes to take -effect. This can be accomplished via the following: - -# /etc/init.d/network restart - - Note that the network control script (/sbin/ifdown) will -remove the bonding module as part of the network shutdown processing, -so it is not necessary to remove the module by hand if, e.g., the -module parameters have changed. - - Also, at this writing, YaST/YaST2 will not manage bonding -devices (they do not show bonding interfaces on its list of network -devices). It is necessary to edit the configuration file by hand to -change the bonding configuration. - - Additional general options and details of the ifcfg file -format can be found in an example ifcfg template file: - -/etc/sysconfig/network/ifcfg.template - - Note that the template does not document the various BONDING_ -settings described above, but does describe many of the other options. - -3.1.1 Using DHCP with Sysconfig -------------------------------- - - Under sysconfig, configuring a device with BOOTPROTO='dhcp' -will cause it to query DHCP for its IP address information. At this -writing, this does not function for bonding devices; the scripts -attempt to obtain the device address from DHCP prior to adding any of -the slave devices. Without active slaves, the DHCP requests are not -sent to the network. - -3.1.2 Configuring Multiple Bonds with Sysconfig ------------------------------------------------ - - The sysconfig network initialization system is capable of -handling multiple bonding devices. All that is necessary is for each -bonding instance to have an appropriately configured ifcfg-bondX file -(as described above). Do not specify the "max_bonds" parameter to any -instance of bonding, as this will confuse sysconfig. If you require -multiple bonding devices with identical parameters, create multiple -ifcfg-bondX files. - - Because the sysconfig scripts supply the bonding module -options in the ifcfg-bondX file, it is not necessary to add them to -the system /etc/modules.d/*.conf configuration files. - -3.2 Configuration with Initscripts Support ------------------------------------------- - - This section applies to distros using a recent version of -initscripts with bonding support, for example, Red Hat Enterprise Linux -version 3 or later, Fedora, etc. On these systems, the network -initialization scripts have knowledge of bonding, and can be configured to -control bonding devices. Note that older versions of the initscripts -package have lower levels of support for bonding; this will be noted where -applicable. - - These distros will not automatically load the network adapter -driver unless the ethX device is configured with an IP address. -Because of this constraint, users must manually configure a -network-script file for all physical adapters that will be members of -a bondX link. Network script files are located in the directory: - -/etc/sysconfig/network-scripts - - The file name must be prefixed with "ifcfg-eth" and suffixed -with the adapter's physical adapter number. For example, the script -for eth0 would be named /etc/sysconfig/network-scripts/ifcfg-eth0. -Place the following text in the file: - -DEVICE=eth0 -USERCTL=no -ONBOOT=yes -MASTER=bond0 -SLAVE=yes -BOOTPROTO=none - - The DEVICE= line will be different for every ethX device and -must correspond with the name of the file, i.e., ifcfg-eth1 must have -a device line of DEVICE=eth1. The setting of the MASTER= line will -also depend on the final bonding interface name chosen for your bond. -As with other network devices, these typically start at 0, and go up -one for each device, i.e., the first bonding instance is bond0, the -second is bond1, and so on. - - Next, create a bond network script. The file name for this -script will be /etc/sysconfig/network-scripts/ifcfg-bondX where X is -the number of the bond. For bond0 the file is named "ifcfg-bond0", -for bond1 it is named "ifcfg-bond1", and so on. Within that file, -place the following text: - -DEVICE=bond0 -IPADDR=192.168.1.1 -NETMASK=255.255.255.0 -NETWORK=192.168.1.0 -BROADCAST=192.168.1.255 -ONBOOT=yes -BOOTPROTO=none -USERCTL=no - - Be sure to change the networking specific lines (IPADDR, -NETMASK, NETWORK and BROADCAST) to match your network configuration. - - For later versions of initscripts, such as that found with Fedora -7 (or later) and Red Hat Enterprise Linux version 5 (or later), it is possible, -and, indeed, preferable, to specify the bonding options in the ifcfg-bond0 -file, e.g. a line of the format: - -BONDING_OPTS="mode=active-backup arp_interval=60 arp_ip_target=192.168.1.254" - - will configure the bond with the specified options. The options -specified in BONDING_OPTS are identical to the bonding module parameters -except for the arp_ip_target field when using versions of initscripts older -than and 8.57 (Fedora 8) and 8.45.19 (Red Hat Enterprise Linux 5.2). When -using older versions each target should be included as a separate option and -should be preceded by a '+' to indicate it should be added to the list of -queried targets, e.g., - - arp_ip_target=+192.168.1.1 arp_ip_target=+192.168.1.2 - - is the proper syntax to specify multiple targets. When specifying -options via BONDING_OPTS, it is not necessary to edit /etc/modprobe.d/*.conf. - - For even older versions of initscripts that do not support -BONDING_OPTS, it is necessary to edit /etc/modprobe.d/*.conf, depending upon -your distro) to load the bonding module with your desired options when the -bond0 interface is brought up. The following lines in /etc/modprobe.d/*.conf -will load the bonding module, and select its options: - -alias bond0 bonding -options bond0 mode=balance-alb miimon=100 - - Replace the sample parameters with the appropriate set of -options for your configuration. - - Finally run "/etc/rc.d/init.d/network restart" as root. This -will restart the networking subsystem and your bond link should be now -up and running. - -3.2.1 Using DHCP with Initscripts ---------------------------------- - - Recent versions of initscripts (the versions supplied with Fedora -Core 3 and Red Hat Enterprise Linux 4, or later versions, are reported to -work) have support for assigning IP information to bonding devices via -DHCP. - - To configure bonding for DHCP, configure it as described -above, except replace the line "BOOTPROTO=none" with "BOOTPROTO=dhcp" -and add a line consisting of "TYPE=Bonding". Note that the TYPE value -is case sensitive. - -3.2.2 Configuring Multiple Bonds with Initscripts -------------------------------------------------- - - Initscripts packages that are included with Fedora 7 and Red Hat -Enterprise Linux 5 support multiple bonding interfaces by simply -specifying the appropriate BONDING_OPTS= in ifcfg-bondX where X is the -number of the bond. This support requires sysfs support in the kernel, -and a bonding driver of version 3.0.0 or later. Other configurations may -not support this method for specifying multiple bonding interfaces; for -those instances, see the "Configuring Multiple Bonds Manually" section, -below. - -3.3 Configuring Bonding Manually with iproute2 ------------------------------------------------ - - This section applies to distros whose network initialization -scripts (the sysconfig or initscripts package) do not have specific -knowledge of bonding. One such distro is SuSE Linux Enterprise Server -version 8. - - The general method for these systems is to place the bonding -module parameters into a config file in /etc/modprobe.d/ (as -appropriate for the installed distro), then add modprobe and/or -`ip link` commands to the system's global init script. The name of -the global init script differs; for sysconfig, it is -/etc/init.d/boot.local and for initscripts it is /etc/rc.d/rc.local. - - For example, if you wanted to make a simple bond of two e100 -devices (presumed to be eth0 and eth1), and have it persist across -reboots, edit the appropriate file (/etc/init.d/boot.local or -/etc/rc.d/rc.local), and add the following: - -modprobe bonding mode=balance-alb miimon=100 -modprobe e100 -ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up -ip link set eth0 master bond0 -ip link set eth1 master bond0 - - Replace the example bonding module parameters and bond0 -network configuration (IP address, netmask, etc) with the appropriate -values for your configuration. - - Unfortunately, this method will not provide support for the -ifup and ifdown scripts on the bond devices. To reload the bonding -configuration, it is necessary to run the initialization script, e.g., - -# /etc/init.d/boot.local - - or - -# /etc/rc.d/rc.local - - It may be desirable in such a case to create a separate script -which only initializes the bonding configuration, then call that -separate script from within boot.local. This allows for bonding to be -enabled without re-running the entire global init script. - - To shut down the bonding devices, it is necessary to first -mark the bonding device itself as being down, then remove the -appropriate device driver modules. For our example above, you can do -the following: - -# ifconfig bond0 down -# rmmod bonding -# rmmod e100 - - Again, for convenience, it may be desirable to create a script -with these commands. - - -3.3.1 Configuring Multiple Bonds Manually ------------------------------------------ - - This section contains information on configuring multiple -bonding devices with differing options for those systems whose network -initialization scripts lack support for configuring multiple bonds. - - If you require multiple bonding devices, but all with the same -options, you may wish to use the "max_bonds" module parameter, -documented above. - - To create multiple bonding devices with differing options, it is -preferable to use bonding parameters exported by sysfs, documented in the -section below. - - For versions of bonding without sysfs support, the only means to -provide multiple instances of bonding with differing options is to load -the bonding driver multiple times. Note that current versions of the -sysconfig network initialization scripts handle this automatically; if -your distro uses these scripts, no special action is needed. See the -section Configuring Bonding Devices, above, if you're not sure about your -network initialization scripts. - - To load multiple instances of the module, it is necessary to -specify a different name for each instance (the module loading system -requires that every loaded module, even multiple instances of the same -module, have a unique name). This is accomplished by supplying multiple -sets of bonding options in /etc/modprobe.d/*.conf, for example: - -alias bond0 bonding -options bond0 -o bond0 mode=balance-rr miimon=100 - -alias bond1 bonding -options bond1 -o bond1 mode=balance-alb miimon=50 - - will load the bonding module two times. The first instance is -named "bond0" and creates the bond0 device in balance-rr mode with an -miimon of 100. The second instance is named "bond1" and creates the -bond1 device in balance-alb mode with an miimon of 50. - - In some circumstances (typically with older distributions), -the above does not work, and the second bonding instance never sees -its options. In that case, the second options line can be substituted -as follows: - -install bond1 /sbin/modprobe --ignore-install bonding -o bond1 \ - mode=balance-alb miimon=50 - - This may be repeated any number of times, specifying a new and -unique name in place of bond1 for each subsequent instance. - - It has been observed that some Red Hat supplied kernels are unable -to rename modules at load time (the "-o bond1" part). Attempts to pass -that option to modprobe will produce an "Operation not permitted" error. -This has been reported on some Fedora Core kernels, and has been seen on -RHEL 4 as well. On kernels exhibiting this problem, it will be impossible -to configure multiple bonds with differing parameters (as they are older -kernels, and also lack sysfs support). - -3.4 Configuring Bonding Manually via Sysfs ------------------------------------------- - - Starting with version 3.0.0, Channel Bonding may be configured -via the sysfs interface. This interface allows dynamic configuration -of all bonds in the system without unloading the module. It also -allows for adding and removing bonds at runtime. Ifenslave is no -longer required, though it is still supported. - - Use of the sysfs interface allows you to use multiple bonds -with different configurations without having to reload the module. -It also allows you to use multiple, differently configured bonds when -bonding is compiled into the kernel. - - You must have the sysfs filesystem mounted to configure -bonding this way. The examples in this document assume that you -are using the standard mount point for sysfs, e.g. /sys. If your -sysfs filesystem is mounted elsewhere, you will need to adjust the -example paths accordingly. - -Creating and Destroying Bonds ------------------------------ -To add a new bond foo: -# echo +foo > /sys/class/net/bonding_masters - -To remove an existing bond bar: -# echo -bar > /sys/class/net/bonding_masters - -To show all existing bonds: -# cat /sys/class/net/bonding_masters - -NOTE: due to 4K size limitation of sysfs files, this list may be -truncated if you have more than a few hundred bonds. This is unlikely -to occur under normal operating conditions. - -Adding and Removing Slaves --------------------------- - Interfaces may be enslaved to a bond using the file -/sys/class/net//bonding/slaves. The semantics for this file -are the same as for the bonding_masters file. - -To enslave interface eth0 to bond bond0: -# ifconfig bond0 up -# echo +eth0 > /sys/class/net/bond0/bonding/slaves - -To free slave eth0 from bond bond0: -# echo -eth0 > /sys/class/net/bond0/bonding/slaves - - When an interface is enslaved to a bond, symlinks between the -two are created in the sysfs filesystem. In this case, you would get -/sys/class/net/bond0/slave_eth0 pointing to /sys/class/net/eth0, and -/sys/class/net/eth0/master pointing to /sys/class/net/bond0. - - This means that you can tell quickly whether or not an -interface is enslaved by looking for the master symlink. Thus: -# echo -eth0 > /sys/class/net/eth0/master/bonding/slaves -will free eth0 from whatever bond it is enslaved to, regardless of -the name of the bond interface. - -Changing a Bond's Configuration -------------------------------- - Each bond may be configured individually by manipulating the -files located in /sys/class/net//bonding - - The names of these files correspond directly with the command- -line parameters described elsewhere in this file, and, with the -exception of arp_ip_target, they accept the same values. To see the -current setting, simply cat the appropriate file. - - A few examples will be given here; for specific usage -guidelines for each parameter, see the appropriate section in this -document. - -To configure bond0 for balance-alb mode: -# ifconfig bond0 down -# echo 6 > /sys/class/net/bond0/bonding/mode - - or - -# echo balance-alb > /sys/class/net/bond0/bonding/mode - NOTE: The bond interface must be down before the mode can be -changed. - -To enable MII monitoring on bond0 with a 1 second interval: -# echo 1000 > /sys/class/net/bond0/bonding/miimon - NOTE: If ARP monitoring is enabled, it will disabled when MII -monitoring is enabled, and vice-versa. - -To add ARP targets: -# echo +192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target -# echo +192.168.0.101 > /sys/class/net/bond0/bonding/arp_ip_target - NOTE: up to 16 target addresses may be specified. - -To remove an ARP target: -# echo -192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target - -To configure the interval between learning packet transmits: -# echo 12 > /sys/class/net/bond0/bonding/lp_interval - NOTE: the lp_interval is the number of seconds between instances where -the bonding driver sends learning packets to each slaves peer switch. The -default interval is 1 second. - -Example Configuration ---------------------- - We begin with the same example that is shown in section 3.3, -executed with sysfs, and without using ifenslave. - - To make a simple bond of two e100 devices (presumed to be eth0 -and eth1), and have it persist across reboots, edit the appropriate -file (/etc/init.d/boot.local or /etc/rc.d/rc.local), and add the -following: - -modprobe bonding -modprobe e100 -echo balance-alb > /sys/class/net/bond0/bonding/mode -ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up -echo 100 > /sys/class/net/bond0/bonding/miimon -echo +eth0 > /sys/class/net/bond0/bonding/slaves -echo +eth1 > /sys/class/net/bond0/bonding/slaves - - To add a second bond, with two e1000 interfaces in -active-backup mode, using ARP monitoring, add the following lines to -your init script: - -modprobe e1000 -echo +bond1 > /sys/class/net/bonding_masters -echo active-backup > /sys/class/net/bond1/bonding/mode -ifconfig bond1 192.168.2.1 netmask 255.255.255.0 up -echo +192.168.2.100 /sys/class/net/bond1/bonding/arp_ip_target -echo 2000 > /sys/class/net/bond1/bonding/arp_interval -echo +eth2 > /sys/class/net/bond1/bonding/slaves -echo +eth3 > /sys/class/net/bond1/bonding/slaves - -3.5 Configuration with Interfaces Support ------------------------------------------ - - This section applies to distros which use /etc/network/interfaces file -to describe network interface configuration, most notably Debian and it's -derivatives. - - The ifup and ifdown commands on Debian don't support bonding out of -the box. The ifenslave-2.6 package should be installed to provide bonding -support. Once installed, this package will provide bond-* options to be used -into /etc/network/interfaces. - - Note that ifenslave-2.6 package will load the bonding module and use -the ifenslave command when appropriate. - -Example Configurations ----------------------- - -In /etc/network/interfaces, the following stanza will configure bond0, in -active-backup mode, with eth0 and eth1 as slaves. - -auto bond0 -iface bond0 inet dhcp - bond-slaves eth0 eth1 - bond-mode active-backup - bond-miimon 100 - bond-primary eth0 eth1 - -If the above configuration doesn't work, you might have a system using -upstart for system startup. This is most notably true for recent -Ubuntu versions. The following stanza in /etc/network/interfaces will -produce the same result on those systems. - -auto bond0 -iface bond0 inet dhcp - bond-slaves none - bond-mode active-backup - bond-miimon 100 - -auto eth0 -iface eth0 inet manual - bond-master bond0 - bond-primary eth0 eth1 - -auto eth1 -iface eth1 inet manual - bond-master bond0 - bond-primary eth0 eth1 - -For a full list of bond-* supported options in /etc/network/interfaces and some -more advanced examples tailored to you particular distros, see the files in -/usr/share/doc/ifenslave-2.6. - -3.6 Overriding Configuration for Special Cases ----------------------------------------------- - -When using the bonding driver, the physical port which transmits a frame is -typically selected by the bonding driver, and is not relevant to the user or -system administrator. The output port is simply selected using the policies of -the selected bonding mode. On occasion however, it is helpful to direct certain -classes of traffic to certain physical interfaces on output to implement -slightly more complex policies. For example, to reach a web server over a -bonded interface in which eth0 connects to a private network, while eth1 -connects via a public network, it may be desirous to bias the bond to send said -traffic over eth0 first, using eth1 only as a fall back, while all other traffic -can safely be sent over either interface. Such configurations may be achieved -using the traffic control utilities inherent in linux. - -By default the bonding driver is multiqueue aware and 16 queues are created -when the driver initializes (see Documentation/networking/multiqueue.txt -for details). If more or less queues are desired the module parameter -tx_queues can be used to change this value. There is no sysfs parameter -available as the allocation is done at module init time. - -The output of the file /proc/net/bonding/bondX has changed so the output Queue -ID is now printed for each slave: - -Bonding Mode: fault-tolerance (active-backup) -Primary Slave: None -Currently Active Slave: eth0 -MII Status: up -MII Polling Interval (ms): 0 -Up Delay (ms): 0 -Down Delay (ms): 0 - -Slave Interface: eth0 -MII Status: up -Link Failure Count: 0 -Permanent HW addr: 00:1a:a0:12:8f:cb -Slave queue ID: 0 - -Slave Interface: eth1 -MII Status: up -Link Failure Count: 0 -Permanent HW addr: 00:1a:a0:12:8f:cc -Slave queue ID: 2 - -The queue_id for a slave can be set using the command: - -# echo "eth1:2" > /sys/class/net/bond0/bonding/queue_id - -Any interface that needs a queue_id set should set it with multiple calls -like the one above until proper priorities are set for all interfaces. On -distributions that allow configuration via initscripts, multiple 'queue_id' -arguments can be added to BONDING_OPTS to set all needed slave queues. - -These queue id's can be used in conjunction with the tc utility to configure -a multiqueue qdisc and filters to bias certain traffic to transmit on certain -slave devices. For instance, say we wanted, in the above configuration to -force all traffic bound to 192.168.1.100 to use eth1 in the bond as its output -device. The following commands would accomplish this: - -# tc qdisc add dev bond0 handle 1 root multiq - -# tc filter add dev bond0 protocol ip parent 1: prio 1 u32 match ip dst \ - 192.168.1.100 action skbedit queue_mapping 2 - -These commands tell the kernel to attach a multiqueue queue discipline to the -bond0 interface and filter traffic enqueued to it, such that packets with a dst -ip of 192.168.1.100 have their output queue mapping value overwritten to 2. -This value is then passed into the driver, causing the normal output path -selection policy to be overridden, selecting instead qid 2, which maps to eth1. - -Note that qid values begin at 1. Qid 0 is reserved to initiate to the driver -that normal output policy selection should take place. One benefit to simply -leaving the qid for a slave to 0 is the multiqueue awareness in the bonding -driver that is now present. This awareness allows tc filters to be placed on -slave devices as well as bond devices and the bonding driver will simply act as -a pass-through for selecting output queues on the slave device rather than -output port selection. - -This feature first appeared in bonding driver version 3.7.0 and support for -output slave selection was limited to round-robin and active-backup modes. - -3.7 Configuring LACP for 802.3ad mode in a more secure way ----------------------------------------------------------- - -When using 802.3ad bonding mode, the Actor (host) and Partner (switch) -exchange LACPDUs. These LACPDUs cannot be sniffed, because they are -destined to link local mac addresses (which switches/bridges are not -supposed to forward). However, most of the values are easily predictable -or are simply the machine's MAC address (which is trivially known to all -other hosts in the same L2). This implies that other machines in the L2 -domain can spoof LACPDU packets from other hosts to the switch and potentially -cause mayhem by joining (from the point of view of the switch) another -machine's aggregate, thus receiving a portion of that hosts incoming -traffic and / or spoofing traffic from that machine themselves (potentially -even successfully terminating some portion of flows). Though this is not -a likely scenario, one could avoid this possibility by simply configuring -few bonding parameters: - - (a) ad_actor_system : You can set a random mac-address that can be used for - these LACPDU exchanges. The value can not be either NULL or Multicast. - Also it's preferable to set the local-admin bit. Following shell code - generates a random mac-address as described above. - - # sys_mac_addr=$(printf '%02x:%02x:%02x:%02x:%02x:%02x' \ - $(( (RANDOM & 0xFE) | 0x02 )) \ - $(( RANDOM & 0xFF )) \ - $(( RANDOM & 0xFF )) \ - $(( RANDOM & 0xFF )) \ - $(( RANDOM & 0xFF )) \ - $(( RANDOM & 0xFF ))) - # echo $sys_mac_addr > /sys/class/net/bond0/bonding/ad_actor_system - - (b) ad_actor_sys_prio : Randomize the system priority. The default value - is 65535, but system can take the value from 1 - 65535. Following shell - code generates random priority and sets it. - - # sys_prio=$(( 1 + RANDOM + RANDOM )) - # echo $sys_prio > /sys/class/net/bond0/bonding/ad_actor_sys_prio - - (c) ad_user_port_key : Use the user portion of the port-key. The default - keeps this empty. These are the upper 10 bits of the port-key and value - ranges from 0 - 1023. Following shell code generates these 10 bits and - sets it. - - # usr_port_key=$(( RANDOM & 0x3FF )) - # echo $usr_port_key > /sys/class/net/bond0/bonding/ad_user_port_key - - -4 Querying Bonding Configuration -================================= - -4.1 Bonding Configuration -------------------------- - - Each bonding device has a read-only file residing in the -/proc/net/bonding directory. The file contents include information -about the bonding configuration, options and state of each slave. - - For example, the contents of /proc/net/bonding/bond0 after the -driver is loaded with parameters of mode=0 and miimon=1000 is -generally as follows: - - Ethernet Channel Bonding Driver: 2.6.1 (October 29, 2004) - Bonding Mode: load balancing (round-robin) - Currently Active Slave: eth0 - MII Status: up - MII Polling Interval (ms): 1000 - Up Delay (ms): 0 - Down Delay (ms): 0 - - Slave Interface: eth1 - MII Status: up - Link Failure Count: 1 - - Slave Interface: eth0 - MII Status: up - Link Failure Count: 1 - - The precise format and contents will change depending upon the -bonding configuration, state, and version of the bonding driver. - -4.2 Network configuration -------------------------- - - The network configuration can be inspected using the ifconfig -command. Bonding devices will have the MASTER flag set; Bonding slave -devices will have the SLAVE flag set. The ifconfig output does not -contain information on which slaves are associated with which masters. - - In the example below, the bond0 interface is the master -(MASTER) while eth0 and eth1 are slaves (SLAVE). Notice all slaves of -bond0 have the same MAC address (HWaddr) as bond0 for all modes except -TLB and ALB that require a unique MAC address for each slave. - -# /sbin/ifconfig -bond0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 - inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0 - UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 - RX packets:7224794 errors:0 dropped:0 overruns:0 frame:0 - TX packets:3286647 errors:1 dropped:0 overruns:1 carrier:0 - collisions:0 txqueuelen:0 - -eth0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 - UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 - RX packets:3573025 errors:0 dropped:0 overruns:0 frame:0 - TX packets:1643167 errors:1 dropped:0 overruns:1 carrier:0 - collisions:0 txqueuelen:100 - Interrupt:10 Base address:0x1080 - -eth1 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 - UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 - RX packets:3651769 errors:0 dropped:0 overruns:0 frame:0 - TX packets:1643480 errors:0 dropped:0 overruns:0 carrier:0 - collisions:0 txqueuelen:100 - Interrupt:9 Base address:0x1400 - -5. Switch Configuration -======================= - - For this section, "switch" refers to whatever system the -bonded devices are directly connected to (i.e., where the other end of -the cable plugs into). This may be an actual dedicated switch device, -or it may be another regular system (e.g., another computer running -Linux), - - The active-backup, balance-tlb and balance-alb modes do not -require any specific configuration of the switch. - - The 802.3ad mode requires that the switch have the appropriate -ports configured as an 802.3ad aggregation. The precise method used -to configure this varies from switch to switch, but, for example, a -Cisco 3550 series switch requires that the appropriate ports first be -grouped together in a single etherchannel instance, then that -etherchannel is set to mode "lacp" to enable 802.3ad (instead of -standard EtherChannel). - - The balance-rr, balance-xor and broadcast modes generally -require that the switch have the appropriate ports grouped together. -The nomenclature for such a group differs between switches, it may be -called an "etherchannel" (as in the Cisco example, above), a "trunk -group" or some other similar variation. For these modes, each switch -will also have its own configuration options for the switch's transmit -policy to the bond. Typical choices include XOR of either the MAC or -IP addresses. The transmit policy of the two peers does not need to -match. For these three modes, the bonding mode really selects a -transmit policy for an EtherChannel group; all three will interoperate -with another EtherChannel group. - - -6. 802.1q VLAN Support -====================== - - It is possible to configure VLAN devices over a bond interface -using the 8021q driver. However, only packets coming from the 8021q -driver and passing through bonding will be tagged by default. Self -generated packets, for example, bonding's learning packets or ARP -packets generated by either ALB mode or the ARP monitor mechanism, are -tagged internally by bonding itself. As a result, bonding must -"learn" the VLAN IDs configured above it, and use those IDs to tag -self generated packets. - - For reasons of simplicity, and to support the use of adapters -that can do VLAN hardware acceleration offloading, the bonding -interface declares itself as fully hardware offloading capable, it gets -the add_vid/kill_vid notifications to gather the necessary -information, and it propagates those actions to the slaves. In case -of mixed adapter types, hardware accelerated tagged packets that -should go through an adapter that is not offloading capable are -"un-accelerated" by the bonding driver so the VLAN tag sits in the -regular location. - - VLAN interfaces *must* be added on top of a bonding interface -only after enslaving at least one slave. The bonding interface has a -hardware address of 00:00:00:00:00:00 until the first slave is added. -If the VLAN interface is created prior to the first enslavement, it -would pick up the all-zeroes hardware address. Once the first slave -is attached to the bond, the bond device itself will pick up the -slave's hardware address, which is then available for the VLAN device. - - Also, be aware that a similar problem can occur if all slaves -are released from a bond that still has one or more VLAN interfaces on -top of it. When a new slave is added, the bonding interface will -obtain its hardware address from the first slave, which might not -match the hardware address of the VLAN interfaces (which was -ultimately copied from an earlier slave). - - There are two methods to insure that the VLAN device operates -with the correct hardware address if all slaves are removed from a -bond interface: - - 1. Remove all VLAN interfaces then recreate them - - 2. Set the bonding interface's hardware address so that it -matches the hardware address of the VLAN interfaces. - - Note that changing a VLAN interface's HW address would set the -underlying device -- i.e. the bonding interface -- to promiscuous -mode, which might not be what you want. - - -7. Link Monitoring -================== - - The bonding driver at present supports two schemes for -monitoring a slave device's link state: the ARP monitor and the MII -monitor. - - At the present time, due to implementation restrictions in the -bonding driver itself, it is not possible to enable both ARP and MII -monitoring simultaneously. - -7.1 ARP Monitor Operation -------------------------- - - The ARP monitor operates as its name suggests: it sends ARP -queries to one or more designated peer systems on the network, and -uses the response as an indication that the link is operating. This -gives some assurance that traffic is actually flowing to and from one -or more peers on the local network. - - The ARP monitor relies on the device driver itself to verify -that traffic is flowing. In particular, the driver must keep up to -date the last receive time, dev->last_rx. Drivers that use NETIF_F_LLTX -flag must also update netdev_queue->trans_start. If they do not, then the -ARP monitor will immediately fail any slaves using that driver, and -those slaves will stay down. If networking monitoring (tcpdump, etc) -shows the ARP requests and replies on the network, then it may be that -your device driver is not updating last_rx and trans_start. - -7.2 Configuring Multiple ARP Targets ------------------------------------- - - While ARP monitoring can be done with just one target, it can -be useful in a High Availability setup to have several targets to -monitor. In the case of just one target, the target itself may go -down or have a problem making it unresponsive to ARP requests. Having -an additional target (or several) increases the reliability of the ARP -monitoring. - - Multiple ARP targets must be separated by commas as follows: - -# example options for ARP monitoring with three targets -alias bond0 bonding -options bond0 arp_interval=60 arp_ip_target=192.168.0.1,192.168.0.3,192.168.0.9 - - For just a single target the options would resemble: - -# example options for ARP monitoring with one target -alias bond0 bonding -options bond0 arp_interval=60 arp_ip_target=192.168.0.100 - - -7.3 MII Monitor Operation -------------------------- - - The MII monitor monitors only the carrier state of the local -network interface. It accomplishes this in one of three ways: by -depending upon the device driver to maintain its carrier state, by -querying the device's MII registers, or by making an ethtool query to -the device. - - If the use_carrier module parameter is 1 (the default value), -then the MII monitor will rely on the driver for carrier state -information (via the netif_carrier subsystem). As explained in the -use_carrier parameter information, above, if the MII monitor fails to -detect carrier loss on the device (e.g., when the cable is physically -disconnected), it may be that the driver does not support -netif_carrier. - - If use_carrier is 0, then the MII monitor will first query the -device's (via ioctl) MII registers and check the link state. If that -request fails (not just that it returns carrier down), then the MII -monitor will make an ethtool ETHOOL_GLINK request to attempt to obtain -the same information. If both methods fail (i.e., the driver either -does not support or had some error in processing both the MII register -and ethtool requests), then the MII monitor will assume the link is -up. - -8. Potential Sources of Trouble -=============================== - -8.1 Adventures in Routing -------------------------- - - When bonding is configured, it is important that the slave -devices not have routes that supersede routes of the master (or, -generally, not have routes at all). For example, suppose the bonding -device bond0 has two slaves, eth0 and eth1, and the routing table is -as follows: - -Kernel IP routing table -Destination Gateway Genmask Flags MSS Window irtt Iface -10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth0 -10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth1 -10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 bond0 -127.0.0.0 0.0.0.0 255.0.0.0 U 40 0 0 lo - - This routing configuration will likely still update the -receive/transmit times in the driver (needed by the ARP monitor), but -may bypass the bonding driver (because outgoing traffic to, in this -case, another host on network 10 would use eth0 or eth1 before bond0). - - The ARP monitor (and ARP itself) may become confused by this -configuration, because ARP requests (generated by the ARP monitor) -will be sent on one interface (bond0), but the corresponding reply -will arrive on a different interface (eth0). This reply looks to ARP -as an unsolicited ARP reply (because ARP matches replies on an -interface basis), and is discarded. The MII monitor is not affected -by the state of the routing table. - - The solution here is simply to insure that slaves do not have -routes of their own, and if for some reason they must, those routes do -not supersede routes of their master. This should generally be the -case, but unusual configurations or errant manual or automatic static -route additions may cause trouble. - -8.2 Ethernet Device Renaming ----------------------------- - - On systems with network configuration scripts that do not -associate physical devices directly with network interface names (so -that the same physical device always has the same "ethX" name), it may -be necessary to add some special logic to config files in -/etc/modprobe.d/. - - For example, given a modules.conf containing the following: - -alias bond0 bonding -options bond0 mode=some-mode miimon=50 -alias eth0 tg3 -alias eth1 tg3 -alias eth2 e1000 -alias eth3 e1000 - - If neither eth0 and eth1 are slaves to bond0, then when the -bond0 interface comes up, the devices may end up reordered. This -happens because bonding is loaded first, then its slave device's -drivers are loaded next. Since no other drivers have been loaded, -when the e1000 driver loads, it will receive eth0 and eth1 for its -devices, but the bonding configuration tries to enslave eth2 and eth3 -(which may later be assigned to the tg3 devices). - - Adding the following: - -add above bonding e1000 tg3 - - causes modprobe to load e1000 then tg3, in that order, when -bonding is loaded. This command is fully documented in the -modules.conf manual page. - - On systems utilizing modprobe an equivalent problem can occur. -In this case, the following can be added to config files in -/etc/modprobe.d/ as: - -softdep bonding pre: tg3 e1000 - - This will load tg3 and e1000 modules before loading the bonding one. -Full documentation on this can be found in the modprobe.d and modprobe -manual pages. - -8.3. Painfully Slow Or No Failed Link Detection By Miimon ---------------------------------------------------------- - - By default, bonding enables the use_carrier option, which -instructs bonding to trust the driver to maintain carrier state. - - As discussed in the options section, above, some drivers do -not support the netif_carrier_on/_off link state tracking system. -With use_carrier enabled, bonding will always see these links as up, -regardless of their actual state. - - Additionally, other drivers do support netif_carrier, but do -not maintain it in real time, e.g., only polling the link state at -some fixed interval. In this case, miimon will detect failures, but -only after some long period of time has expired. If it appears that -miimon is very slow in detecting link failures, try specifying -use_carrier=0 to see if that improves the failure detection time. If -it does, then it may be that the driver checks the carrier state at a -fixed interval, but does not cache the MII register values (so the -use_carrier=0 method of querying the registers directly works). If -use_carrier=0 does not improve the failover, then the driver may cache -the registers, or the problem may be elsewhere. - - Also, remember that miimon only checks for the device's -carrier state. It has no way to determine the state of devices on or -beyond other ports of a switch, or if a switch is refusing to pass -traffic while still maintaining carrier on. - -9. SNMP agents -=============== - - If running SNMP agents, the bonding driver should be loaded -before any network drivers participating in a bond. This requirement -is due to the interface index (ipAdEntIfIndex) being associated to -the first interface found with a given IP address. That is, there is -only one ipAdEntIfIndex for each IP address. For example, if eth0 and -eth1 are slaves of bond0 and the driver for eth0 is loaded before the -bonding driver, the interface for the IP address will be associated -with the eth0 interface. This configuration is shown below, the IP -address 192.168.1.1 has an interface index of 2 which indexes to eth0 -in the ifDescr table (ifDescr.2). - - interfaces.ifTable.ifEntry.ifDescr.1 = lo - interfaces.ifTable.ifEntry.ifDescr.2 = eth0 - interfaces.ifTable.ifEntry.ifDescr.3 = eth1 - interfaces.ifTable.ifEntry.ifDescr.4 = eth2 - interfaces.ifTable.ifEntry.ifDescr.5 = eth3 - interfaces.ifTable.ifEntry.ifDescr.6 = bond0 - ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 5 - ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2 - ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 4 - ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1 - - This problem is avoided by loading the bonding driver before -any network drivers participating in a bond. Below is an example of -loading the bonding driver first, the IP address 192.168.1.1 is -correctly associated with ifDescr.2. - - interfaces.ifTable.ifEntry.ifDescr.1 = lo - interfaces.ifTable.ifEntry.ifDescr.2 = bond0 - interfaces.ifTable.ifEntry.ifDescr.3 = eth0 - interfaces.ifTable.ifEntry.ifDescr.4 = eth1 - interfaces.ifTable.ifEntry.ifDescr.5 = eth2 - interfaces.ifTable.ifEntry.ifDescr.6 = eth3 - ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 6 - ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2 - ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 5 - ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1 - - While some distributions may not report the interface name in -ifDescr, the association between the IP address and IfIndex remains -and SNMP functions such as Interface_Scan_Next will report that -association. - -10. Promiscuous mode -==================== - - When running network monitoring tools, e.g., tcpdump, it is -common to enable promiscuous mode on the device, so that all traffic -is seen (instead of seeing only traffic destined for the local host). -The bonding driver handles promiscuous mode changes to the bonding -master device (e.g., bond0), and propagates the setting to the slave -devices. - - For the balance-rr, balance-xor, broadcast, and 802.3ad modes, -the promiscuous mode setting is propagated to all slaves. - - For the active-backup, balance-tlb and balance-alb modes, the -promiscuous mode setting is propagated only to the active slave. - - For balance-tlb mode, the active slave is the slave currently -receiving inbound traffic. - - For balance-alb mode, the active slave is the slave used as a -"primary." This slave is used for mode-specific control traffic, for -sending to peers that are unassigned or if the load is unbalanced. - - For the active-backup, balance-tlb and balance-alb modes, when -the active slave changes (e.g., due to a link failure), the -promiscuous setting will be propagated to the new active slave. - -11. Configuring Bonding for High Availability -============================================= - - High Availability refers to configurations that provide -maximum network availability by having redundant or backup devices, -links or switches between the host and the rest of the world. The -goal is to provide the maximum availability of network connectivity -(i.e., the network always works), even though other configurations -could provide higher throughput. - -11.1 High Availability in a Single Switch Topology --------------------------------------------------- - - If two hosts (or a host and a single switch) are directly -connected via multiple physical links, then there is no availability -penalty to optimizing for maximum bandwidth. In this case, there is -only one switch (or peer), so if it fails, there is no alternative -access to fail over to. Additionally, the bonding load balance modes -support link monitoring of their members, so if individual links fail, -the load will be rebalanced across the remaining devices. - - See Section 12, "Configuring Bonding for Maximum Throughput" -for information on configuring bonding with one peer device. - -11.2 High Availability in a Multiple Switch Topology ----------------------------------------------------- - - With multiple switches, the configuration of bonding and the -network changes dramatically. In multiple switch topologies, there is -a trade off between network availability and usable bandwidth. - - Below is a sample network, configured to maximize the -availability of the network: - - | | - |port3 port3| - +-----+----+ +-----+----+ - | |port2 ISL port2| | - | switch A +--------------------------+ switch B | - | | | | - +-----+----+ +-----++---+ - |port1 port1| - | +-------+ | - +-------------+ host1 +---------------+ - eth0 +-------+ eth1 - - In this configuration, there is a link between the two -switches (ISL, or inter switch link), and multiple ports connecting to -the outside world ("port3" on each switch). There is no technical -reason that this could not be extended to a third switch. - -11.2.1 HA Bonding Mode Selection for Multiple Switch Topology -------------------------------------------------------------- - - In a topology such as the example above, the active-backup and -broadcast modes are the only useful bonding modes when optimizing for -availability; the other modes require all links to terminate on the -same peer for them to behave rationally. - -active-backup: This is generally the preferred mode, particularly if - the switches have an ISL and play together well. If the - network configuration is such that one switch is specifically - a backup switch (e.g., has lower capacity, higher cost, etc), - then the primary option can be used to insure that the - preferred link is always used when it is available. - -broadcast: This mode is really a special purpose mode, and is suitable - only for very specific needs. For example, if the two - switches are not connected (no ISL), and the networks beyond - them are totally independent. In this case, if it is - necessary for some specific one-way traffic to reach both - independent networks, then the broadcast mode may be suitable. - -11.2.2 HA Link Monitoring Selection for Multiple Switch Topology ----------------------------------------------------------------- - - The choice of link monitoring ultimately depends upon your -switch. If the switch can reliably fail ports in response to other -failures, then either the MII or ARP monitors should work. For -example, in the above example, if the "port3" link fails at the remote -end, the MII monitor has no direct means to detect this. The ARP -monitor could be configured with a target at the remote end of port3, -thus detecting that failure without switch support. - - In general, however, in a multiple switch topology, the ARP -monitor can provide a higher level of reliability in detecting end to -end connectivity failures (which may be caused by the failure of any -individual component to pass traffic for any reason). Additionally, -the ARP monitor should be configured with multiple targets (at least -one for each switch in the network). This will insure that, -regardless of which switch is active, the ARP monitor has a suitable -target to query. - - Note, also, that of late many switches now support a functionality -generally referred to as "trunk failover." This is a feature of the -switch that causes the link state of a particular switch port to be set -down (or up) when the state of another switch port goes down (or up). -Its purpose is to propagate link failures from logically "exterior" ports -to the logically "interior" ports that bonding is able to monitor via -miimon. Availability and configuration for trunk failover varies by -switch, but this can be a viable alternative to the ARP monitor when using -suitable switches. - -12. Configuring Bonding for Maximum Throughput -============================================== - -12.1 Maximizing Throughput in a Single Switch Topology ------------------------------------------------------- - - In a single switch configuration, the best method to maximize -throughput depends upon the application and network environment. The -various load balancing modes each have strengths and weaknesses in -different environments, as detailed below. - - For this discussion, we will break down the topologies into -two categories. Depending upon the destination of most traffic, we -categorize them into either "gatewayed" or "local" configurations. - - In a gatewayed configuration, the "switch" is acting primarily -as a router, and the majority of traffic passes through this router to -other networks. An example would be the following: - - - +----------+ +----------+ - | |eth0 port1| | to other networks - | Host A +---------------------+ router +-------------------> - | +---------------------+ | Hosts B and C are out - | |eth1 port2| | here somewhere - +----------+ +----------+ - - The router may be a dedicated router device, or another host -acting as a gateway. For our discussion, the important point is that -the majority of traffic from Host A will pass through the router to -some other network before reaching its final destination. - - In a gatewayed network configuration, although Host A may -communicate with many other systems, all of its traffic will be sent -and received via one other peer on the local network, the router. - - Note that the case of two systems connected directly via -multiple physical links is, for purposes of configuring bonding, the -same as a gatewayed configuration. In that case, it happens that all -traffic is destined for the "gateway" itself, not some other network -beyond the gateway. - - In a local configuration, the "switch" is acting primarily as -a switch, and the majority of traffic passes through this switch to -reach other stations on the same network. An example would be the -following: - - +----------+ +----------+ +--------+ - | |eth0 port1| +-------+ Host B | - | Host A +------------+ switch |port3 +--------+ - | +------------+ | +--------+ - | |eth1 port2| +------------------+ Host C | - +----------+ +----------+port4 +--------+ - - - Again, the switch may be a dedicated switch device, or another -host acting as a gateway. For our discussion, the important point is -that the majority of traffic from Host A is destined for other hosts -on the same local network (Hosts B and C in the above example). - - In summary, in a gatewayed configuration, traffic to and from -the bonded device will be to the same MAC level peer on the network -(the gateway itself, i.e., the router), regardless of its final -destination. In a local configuration, traffic flows directly to and -from the final destinations, thus, each destination (Host B, Host C) -will be addressed directly by their individual MAC addresses. - - This distinction between a gatewayed and a local network -configuration is important because many of the load balancing modes -available use the MAC addresses of the local network source and -destination to make load balancing decisions. The behavior of each -mode is described below. - - -12.1.1 MT Bonding Mode Selection for Single Switch Topology ------------------------------------------------------------ - - This configuration is the easiest to set up and to understand, -although you will have to decide which bonding mode best suits your -needs. The trade offs for each mode are detailed below: - -balance-rr: This mode is the only mode that will permit a single - TCP/IP connection to stripe traffic across multiple - interfaces. It is therefore the only mode that will allow a - single TCP/IP stream to utilize more than one interface's - worth of throughput. This comes at a cost, however: the - striping generally results in peer systems receiving packets out - of order, causing TCP/IP's congestion control system to kick - in, often by retransmitting segments. - - It is possible to adjust TCP/IP's congestion limits by - altering the net.ipv4.tcp_reordering sysctl parameter. The - usual default value is 3. But keep in mind TCP stack is able - to automatically increase this when it detects reorders. - - Note that the fraction of packets that will be delivered out of - order is highly variable, and is unlikely to be zero. The level - of reordering depends upon a variety of factors, including the - networking interfaces, the switch, and the topology of the - configuration. Speaking in general terms, higher speed network - cards produce more reordering (due to factors such as packet - coalescing), and a "many to many" topology will reorder at a - higher rate than a "many slow to one fast" configuration. - - Many switches do not support any modes that stripe traffic - (instead choosing a port based upon IP or MAC level addresses); - for those devices, traffic for a particular connection flowing - through the switch to a balance-rr bond will not utilize greater - than one interface's worth of bandwidth. - - If you are utilizing protocols other than TCP/IP, UDP for - example, and your application can tolerate out of order - delivery, then this mode can allow for single stream datagram - performance that scales near linearly as interfaces are added - to the bond. - - This mode requires the switch to have the appropriate ports - configured for "etherchannel" or "trunking." - -active-backup: There is not much advantage in this network topology to - the active-backup mode, as the inactive backup devices are all - connected to the same peer as the primary. In this case, a - load balancing mode (with link monitoring) will provide the - same level of network availability, but with increased - available bandwidth. On the plus side, active-backup mode - does not require any configuration of the switch, so it may - have value if the hardware available does not support any of - the load balance modes. - -balance-xor: This mode will limit traffic such that packets destined - for specific peers will always be sent over the same - interface. Since the destination is determined by the MAC - addresses involved, this mode works best in a "local" network - configuration (as described above), with destinations all on - the same local network. This mode is likely to be suboptimal - if all your traffic is passed through a single router (i.e., a - "gatewayed" network configuration, as described above). - - As with balance-rr, the switch ports need to be configured for - "etherchannel" or "trunking." - -broadcast: Like active-backup, there is not much advantage to this - mode in this type of network topology. - -802.3ad: This mode can be a good choice for this type of network - topology. The 802.3ad mode is an IEEE standard, so all peers - that implement 802.3ad should interoperate well. The 802.3ad - protocol includes automatic configuration of the aggregates, - so minimal manual configuration of the switch is needed - (typically only to designate that some set of devices is - available for 802.3ad). The 802.3ad standard also mandates - that frames be delivered in order (within certain limits), so - in general single connections will not see misordering of - packets. The 802.3ad mode does have some drawbacks: the - standard mandates that all devices in the aggregate operate at - the same speed and duplex. Also, as with all bonding load - balance modes other than balance-rr, no single connection will - be able to utilize more than a single interface's worth of - bandwidth. - - Additionally, the linux bonding 802.3ad implementation - distributes traffic by peer (using an XOR of MAC addresses - and packet type ID), so in a "gatewayed" configuration, all - outgoing traffic will generally use the same device. Incoming - traffic may also end up on a single device, but that is - dependent upon the balancing policy of the peer's 802.3ad - implementation. In a "local" configuration, traffic will be - distributed across the devices in the bond. - - Finally, the 802.3ad mode mandates the use of the MII monitor, - therefore, the ARP monitor is not available in this mode. - -balance-tlb: The balance-tlb mode balances outgoing traffic by peer. - Since the balancing is done according to MAC address, in a - "gatewayed" configuration (as described above), this mode will - send all traffic across a single device. However, in a - "local" network configuration, this mode balances multiple - local network peers across devices in a vaguely intelligent - manner (not a simple XOR as in balance-xor or 802.3ad mode), - so that mathematically unlucky MAC addresses (i.e., ones that - XOR to the same value) will not all "bunch up" on a single - interface. - - Unlike 802.3ad, interfaces may be of differing speeds, and no - special switch configuration is required. On the down side, - in this mode all incoming traffic arrives over a single - interface, this mode requires certain ethtool support in the - network device driver of the slave interfaces, and the ARP - monitor is not available. - -balance-alb: This mode is everything that balance-tlb is, and more. - It has all of the features (and restrictions) of balance-tlb, - and will also balance incoming traffic from local network - peers (as described in the Bonding Module Options section, - above). - - The only additional down side to this mode is that the network - device driver must support changing the hardware address while - the device is open. - -12.1.2 MT Link Monitoring for Single Switch Topology ----------------------------------------------------- - - The choice of link monitoring may largely depend upon which -mode you choose to use. The more advanced load balancing modes do not -support the use of the ARP monitor, and are thus restricted to using -the MII monitor (which does not provide as high a level of end to end -assurance as the ARP monitor). - -12.2 Maximum Throughput in a Multiple Switch Topology ------------------------------------------------------ - - Multiple switches may be utilized to optimize for throughput -when they are configured in parallel as part of an isolated network -between two or more systems, for example: - - +-----------+ - | Host A | - +-+---+---+-+ - | | | - +--------+ | +---------+ - | | | - +------+---+ +-----+----+ +-----+----+ - | Switch A | | Switch B | | Switch C | - +------+---+ +-----+----+ +-----+----+ - | | | - +--------+ | +---------+ - | | | - +-+---+---+-+ - | Host B | - +-----------+ - - In this configuration, the switches are isolated from one -another. One reason to employ a topology such as this is for an -isolated network with many hosts (a cluster configured for high -performance, for example), using multiple smaller switches can be more -cost effective than a single larger switch, e.g., on a network with 24 -hosts, three 24 port switches can be significantly less expensive than -a single 72 port switch. - - If access beyond the network is required, an individual host -can be equipped with an additional network device connected to an -external network; this host then additionally acts as a gateway. - -12.2.1 MT Bonding Mode Selection for Multiple Switch Topology -------------------------------------------------------------- - - In actual practice, the bonding mode typically employed in -configurations of this type is balance-rr. Historically, in this -network configuration, the usual caveats about out of order packet -delivery are mitigated by the use of network adapters that do not do -any kind of packet coalescing (via the use of NAPI, or because the -device itself does not generate interrupts until some number of -packets has arrived). When employed in this fashion, the balance-rr -mode allows individual connections between two hosts to effectively -utilize greater than one interface's bandwidth. - -12.2.2 MT Link Monitoring for Multiple Switch Topology ------------------------------------------------------- - - Again, in actual practice, the MII monitor is most often used -in this configuration, as performance is given preference over -availability. The ARP monitor will function in this topology, but its -advantages over the MII monitor are mitigated by the volume of probes -needed as the number of systems involved grows (remember that each -host in the network is configured with bonding). - -13. Switch Behavior Issues -========================== - -13.1 Link Establishment and Failover Delays -------------------------------------------- - - Some switches exhibit undesirable behavior with regard to the -timing of link up and down reporting by the switch. - - First, when a link comes up, some switches may indicate that -the link is up (carrier available), but not pass traffic over the -interface for some period of time. This delay is typically due to -some type of autonegotiation or routing protocol, but may also occur -during switch initialization (e.g., during recovery after a switch -failure). If you find this to be a problem, specify an appropriate -value to the updelay bonding module option to delay the use of the -relevant interface(s). - - Second, some switches may "bounce" the link state one or more -times while a link is changing state. This occurs most commonly while -the switch is initializing. Again, an appropriate updelay value may -help. - - Note that when a bonding interface has no active links, the -driver will immediately reuse the first link that goes up, even if the -updelay parameter has been specified (the updelay is ignored in this -case). If there are slave interfaces waiting for the updelay timeout -to expire, the interface that first went into that state will be -immediately reused. This reduces down time of the network if the -value of updelay has been overestimated, and since this occurs only in -cases with no connectivity, there is no additional penalty for -ignoring the updelay. - - In addition to the concerns about switch timings, if your -switches take a long time to go into backup mode, it may be desirable -to not activate a backup interface immediately after a link goes down. -Failover may be delayed via the downdelay bonding module option. - -13.2 Duplicated Incoming Packets --------------------------------- - - NOTE: Starting with version 3.0.2, the bonding driver has logic to -suppress duplicate packets, which should largely eliminate this problem. -The following description is kept for reference. - - It is not uncommon to observe a short burst of duplicated -traffic when the bonding device is first used, or after it has been -idle for some period of time. This is most easily observed by issuing -a "ping" to some other host on the network, and noticing that the -output from ping flags duplicates (typically one per slave). - - For example, on a bond in active-backup mode with five slaves -all connected to one switch, the output may appear as follows: - -# ping -n 10.0.4.2 -PING 10.0.4.2 (10.0.4.2) from 10.0.3.10 : 56(84) bytes of data. -64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.7 ms -64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) -64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) -64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) -64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) -64 bytes from 10.0.4.2: icmp_seq=2 ttl=64 time=0.216 ms -64 bytes from 10.0.4.2: icmp_seq=3 ttl=64 time=0.267 ms -64 bytes from 10.0.4.2: icmp_seq=4 ttl=64 time=0.222 ms - - This is not due to an error in the bonding driver, rather, it -is a side effect of how many switches update their MAC forwarding -tables. Initially, the switch does not associate the MAC address in -the packet with a particular switch port, and so it may send the -traffic to all ports until its MAC forwarding table is updated. Since -the interfaces attached to the bond may occupy multiple ports on a -single switch, when the switch (temporarily) floods the traffic to all -ports, the bond device receives multiple copies of the same packet -(one per slave device). - - The duplicated packet behavior is switch dependent, some -switches exhibit this, and some do not. On switches that display this -behavior, it can be induced by clearing the MAC forwarding table (on -most Cisco switches, the privileged command "clear mac address-table -dynamic" will accomplish this). - -14. Hardware Specific Considerations -==================================== - - This section contains additional information for configuring -bonding on specific hardware platforms, or for interfacing bonding -with particular switches or other devices. - -14.1 IBM BladeCenter --------------------- - - This applies to the JS20 and similar systems. - - On the JS20 blades, the bonding driver supports only -balance-rr, active-backup, balance-tlb and balance-alb modes. This is -largely due to the network topology inside the BladeCenter, detailed -below. - -JS20 network adapter information --------------------------------- - - All JS20s come with two Broadcom Gigabit Ethernet ports -integrated on the planar (that's "motherboard" in IBM-speak). In the -BladeCenter chassis, the eth0 port of all JS20 blades is hard wired to -I/O Module #1; similarly, all eth1 ports are wired to I/O Module #2. -An add-on Broadcom daughter card can be installed on a JS20 to provide -two more Gigabit Ethernet ports. These ports, eth2 and eth3, are -wired to I/O Modules 3 and 4, respectively. - - Each I/O Module may contain either a switch or a passthrough -module (which allows ports to be directly connected to an external -switch). Some bonding modes require a specific BladeCenter internal -network topology in order to function; these are detailed below. - - Additional BladeCenter-specific networking information can be -found in two IBM Redbooks (www.ibm.com/redbooks): - -"IBM eServer BladeCenter Networking Options" -"IBM eServer BladeCenter Layer 2-7 Network Switching" - -BladeCenter networking configuration ------------------------------------- - - Because a BladeCenter can be configured in a very large number -of ways, this discussion will be confined to describing basic -configurations. - - Normally, Ethernet Switch Modules (ESMs) are used in I/O -modules 1 and 2. In this configuration, the eth0 and eth1 ports of a -JS20 will be connected to different internal switches (in the -respective I/O modules). - - A passthrough module (OPM or CPM, optical or copper, -passthrough module) connects the I/O module directly to an external -switch. By using PMs in I/O module #1 and #2, the eth0 and eth1 -interfaces of a JS20 can be redirected to the outside world and -connected to a common external switch. - - Depending upon the mix of ESMs and PMs, the network will -appear to bonding as either a single switch topology (all PMs) or as a -multiple switch topology (one or more ESMs, zero or more PMs). It is -also possible to connect ESMs together, resulting in a configuration -much like the example in "High Availability in a Multiple Switch -Topology," above. - -Requirements for specific modes -------------------------------- - - The balance-rr mode requires the use of passthrough modules -for devices in the bond, all connected to an common external switch. -That switch must be configured for "etherchannel" or "trunking" on the -appropriate ports, as is usual for balance-rr. - - The balance-alb and balance-tlb modes will function with -either switch modules or passthrough modules (or a mix). The only -specific requirement for these modes is that all network interfaces -must be able to reach all destinations for traffic sent over the -bonding device (i.e., the network must converge at some point outside -the BladeCenter). - - The active-backup mode has no additional requirements. - -Link monitoring issues ----------------------- - - When an Ethernet Switch Module is in place, only the ARP -monitor will reliably detect link loss to an external switch. This is -nothing unusual, but examination of the BladeCenter cabinet would -suggest that the "external" network ports are the ethernet ports for -the system, when it fact there is a switch between these "external" -ports and the devices on the JS20 system itself. The MII monitor is -only able to detect link failures between the ESM and the JS20 system. - - When a passthrough module is in place, the MII monitor does -detect failures to the "external" port, which is then directly -connected to the JS20 system. - -Other concerns --------------- - - The Serial Over LAN (SoL) link is established over the primary -ethernet (eth0) only, therefore, any loss of link to eth0 will result -in losing your SoL connection. It will not fail over with other -network traffic, as the SoL system is beyond the control of the -bonding driver. - - It may be desirable to disable spanning tree on the switch -(either the internal Ethernet Switch Module, or an external switch) to -avoid fail-over delay issues when using bonding. - - -15. Frequently Asked Questions -============================== - -1. Is it SMP safe? - - Yes. The old 2.0.xx channel bonding patch was not SMP safe. -The new driver was designed to be SMP safe from the start. - -2. What type of cards will work with it? - - Any Ethernet type cards (you can even mix cards - a Intel -EtherExpress PRO/100 and a 3com 3c905b, for example). For most modes, -devices need not be of the same speed. - - Starting with version 3.2.1, bonding also supports Infiniband -slaves in active-backup mode. - -3. How many bonding devices can I have? - - There is no limit. - -4. How many slaves can a bonding device have? - - This is limited only by the number of network interfaces Linux -supports and/or the number of network cards you can place in your -system. - -5. What happens when a slave link dies? - - If link monitoring is enabled, then the failing device will be -disabled. The active-backup mode will fail over to a backup link, and -other modes will ignore the failed link. The link will continue to be -monitored, and should it recover, it will rejoin the bond (in whatever -manner is appropriate for the mode). See the sections on High -Availability and the documentation for each mode for additional -information. - - Link monitoring can be enabled via either the miimon or -arp_interval parameters (described in the module parameters section, -above). In general, miimon monitors the carrier state as sensed by -the underlying network device, and the arp monitor (arp_interval) -monitors connectivity to another host on the local network. - - If no link monitoring is configured, the bonding driver will -be unable to detect link failures, and will assume that all links are -always available. This will likely result in lost packets, and a -resulting degradation of performance. The precise performance loss -depends upon the bonding mode and network configuration. - -6. Can bonding be used for High Availability? - - Yes. See the section on High Availability for details. - -7. Which switches/systems does it work with? - - The full answer to this depends upon the desired mode. - - In the basic balance modes (balance-rr and balance-xor), it -works with any system that supports etherchannel (also called -trunking). Most managed switches currently available have such -support, and many unmanaged switches as well. - - The advanced balance modes (balance-tlb and balance-alb) do -not have special switch requirements, but do need device drivers that -support specific features (described in the appropriate section under -module parameters, above). - - In 802.3ad mode, it works with systems that support IEEE -802.3ad Dynamic Link Aggregation. Most managed and many unmanaged -switches currently available support 802.3ad. - - The active-backup mode should work with any Layer-II switch. - -8. Where does a bonding device get its MAC address from? - - When using slave devices that have fixed MAC addresses, or when -the fail_over_mac option is enabled, the bonding device's MAC address is -the MAC address of the active slave. - - For other configurations, if not explicitly configured (with -ifconfig or ip link), the MAC address of the bonding device is taken from -its first slave device. This MAC address is then passed to all following -slaves and remains persistent (even if the first slave is removed) until -the bonding device is brought down or reconfigured. - - If you wish to change the MAC address, you can set it with -ifconfig or ip link: - -# ifconfig bond0 hw ether 00:11:22:33:44:55 - -# ip link set bond0 address 66:77:88:99:aa:bb - - The MAC address can be also changed by bringing down/up the -device and then changing its slaves (or their order): - -# ifconfig bond0 down ; modprobe -r bonding -# ifconfig bond0 .... up -# ifenslave bond0 eth... - - This method will automatically take the address from the next -slave that is added. - - To restore your slaves' MAC addresses, you need to detach them -from the bond (`ifenslave -d bond0 eth0'). The bonding driver will -then restore the MAC addresses that the slaves had before they were -enslaved. - -16. Resources and Links -======================= - - The latest version of the bonding driver can be found in the latest -version of the linux kernel, found on http://kernel.org - - The latest version of this document can be found in the latest kernel -source (named Documentation/networking/bonding.txt). - - Discussions regarding the usage of the bonding driver take place on the -bonding-devel mailing list, hosted at sourceforge.net. If you have questions or -problems, post them to the list. The list address is: - -bonding-devel@lists.sourceforge.net - - The administrative interface (to subscribe or unsubscribe) can -be found at: - -https://lists.sourceforge.net/lists/listinfo/bonding-devel - - Discussions regarding the development of the bonding driver take place -on the main Linux network mailing list, hosted at vger.kernel.org. The list -address is: - -netdev@vger.kernel.org - - The administrative interface (to subscribe or unsubscribe) can -be found at: - -http://vger.kernel.org/vger-lists.html#netdev - -Donald Becker's Ethernet Drivers and diag programs may be found at : - - http://web.archive.org/web/*/http://www.scyld.com/network/ - -You will also find a lot of information regarding Ethernet, NWay, MII, -etc. at www.scyld.com. - --- END -- diff --git a/Documentation/networking/device_drivers/intel/e100.rst b/Documentation/networking/device_drivers/intel/e100.rst index caf023cc88de..3ac21e7119a7 100644 --- a/Documentation/networking/device_drivers/intel/e100.rst +++ b/Documentation/networking/device_drivers/intel/e100.rst @@ -33,7 +33,7 @@ The following features are now available in supported kernels: - SNMP Channel Bonding documentation can be found in the Linux kernel source: -/Documentation/networking/bonding.txt +/Documentation/networking/bonding.rst Identifying Your Adapter diff --git a/Documentation/networking/device_drivers/intel/ixgb.rst b/Documentation/networking/device_drivers/intel/ixgb.rst index 945018207a92..ab624f1a44a8 100644 --- a/Documentation/networking/device_drivers/intel/ixgb.rst +++ b/Documentation/networking/device_drivers/intel/ixgb.rst @@ -37,7 +37,7 @@ The following features are available in this kernel: - SNMP Channel Bonding documentation can be found in the Linux kernel source: -/Documentation/networking/bonding.txt +/Documentation/networking/bonding.rst The driver information previously displayed in the /proc filesystem is not supported in this release. Alternatively, you can use ethtool (version 1.6 diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index fbf845fbaff7..22b872834ef0 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -44,6 +44,7 @@ Contents: atm ax25 baycom + bonding .. only:: subproject and html diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index b103fbdd0f68..4ab6d343fd86 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -50,7 +50,7 @@ config BONDING The driver supports multiple bonding modes to allow for both high performance and high availability operation. - Refer to for more + Refer to for more information. To compile this driver as a module, choose M here: the module -- cgit v1.2.3 From 92f06f4226fd9bdd6fbbd2e8b84601fc14b5855e Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:25 +0200 Subject: docs: networking: convert cdc_mbim.txt to ReST - add SPDX header; - mark code blocks and literals as such; - use :field: markup; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/cdc_mbim.rst | 355 ++++++++++++++++++++++++++++++++++ Documentation/networking/cdc_mbim.txt | 339 -------------------------------- Documentation/networking/index.rst | 1 + 3 files changed, 356 insertions(+), 339 deletions(-) create mode 100644 Documentation/networking/cdc_mbim.rst delete mode 100644 Documentation/networking/cdc_mbim.txt (limited to 'Documentation') diff --git a/Documentation/networking/cdc_mbim.rst b/Documentation/networking/cdc_mbim.rst new file mode 100644 index 000000000000..0048409c06b4 --- /dev/null +++ b/Documentation/networking/cdc_mbim.rst @@ -0,0 +1,355 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================================================== +cdc_mbim - Driver for CDC MBIM Mobile Broadband modems +====================================================== + +The cdc_mbim driver supports USB devices conforming to the "Universal +Serial Bus Communications Class Subclass Specification for Mobile +Broadband Interface Model" [1], which is a further development of +"Universal Serial Bus Communications Class Subclass Specifications for +Network Control Model Devices" [2] optimized for Mobile Broadband +devices, aka "3G/LTE modems". + + +Command Line Parameters +======================= + +The cdc_mbim driver has no parameters of its own. But the probing +behaviour for NCM 1.0 backwards compatible MBIM functions (an +"NCM/MBIM function" as defined in section 3.2 of [1]) is affected +by a cdc_ncm driver parameter: + +prefer_mbim +----------- +:Type: Boolean +:Valid Range: N/Y (0-1) +:Default Value: Y (MBIM is preferred) + +This parameter sets the system policy for NCM/MBIM functions. Such +functions will be handled by either the cdc_ncm driver or the cdc_mbim +driver depending on the prefer_mbim setting. Setting prefer_mbim=N +makes the cdc_mbim driver ignore these functions and lets the cdc_ncm +driver handle them instead. + +The parameter is writable, and can be changed at any time. A manual +unbind/bind is required to make the change effective for NCM/MBIM +functions bound to the "wrong" driver + + +Basic usage +=========== + +MBIM functions are inactive when unmanaged. The cdc_mbim driver only +provides a userspace interface to the MBIM control channel, and will +not participate in the management of the function. This implies that a +userspace MBIM management application always is required to enable a +MBIM function. + +Such userspace applications includes, but are not limited to: + + - mbimcli (included with the libmbim [3] library), and + - ModemManager [4] + +Establishing a MBIM IP session reequires at least these actions by the +management application: + + - open the control channel + - configure network connection settings + - connect to network + - configure IP interface + +Management application development +---------------------------------- +The driver <-> userspace interfaces are described below. The MBIM +control channel protocol is described in [1]. + + +MBIM control channel userspace ABI +================================== + +/dev/cdc-wdmX character device +------------------------------ +The driver creates a two-way pipe to the MBIM function control channel +using the cdc-wdm driver as a subdriver. The userspace end of the +control channel pipe is a /dev/cdc-wdmX character device. + +The cdc_mbim driver does not process or police messages on the control +channel. The channel is fully delegated to the userspace management +application. It is therefore up to this application to ensure that it +complies with all the control channel requirements in [1]. + +The cdc-wdmX device is created as a child of the MBIM control +interface USB device. The character device associated with a specific +MBIM function can be looked up using sysfs. For example:: + + bjorn@nemi:~$ ls /sys/bus/usb/drivers/cdc_mbim/2-4:2.12/usbmisc + cdc-wdm0 + + bjorn@nemi:~$ grep . /sys/bus/usb/drivers/cdc_mbim/2-4:2.12/usbmisc/cdc-wdm0/dev + 180:0 + + +USB configuration descriptors +----------------------------- +The wMaxControlMessage field of the CDC MBIM functional descriptor +limits the maximum control message size. The managament application is +responsible for negotiating a control message size complying with the +requirements in section 9.3.1 of [1], taking this descriptor field +into consideration. + +The userspace application can access the CDC MBIM functional +descriptor of a MBIM function using either of the two USB +configuration descriptor kernel interfaces described in [6] or [7]. + +See also the ioctl documentation below. + + +Fragmentation +------------- +The userspace application is responsible for all control message +fragmentation and defragmentaion, as described in section 9.5 of [1]. + + +/dev/cdc-wdmX write() +--------------------- +The MBIM control messages from the management application *must not* +exceed the negotiated control message size. + + +/dev/cdc-wdmX read() +-------------------- +The management application *must* accept control messages of up the +negotiated control message size. + + +/dev/cdc-wdmX ioctl() +--------------------- +IOCTL_WDM_MAX_COMMAND: Get Maximum Command Size +This ioctl returns the wMaxControlMessage field of the CDC MBIM +functional descriptor for MBIM devices. This is intended as a +convenience, eliminating the need to parse the USB descriptors from +userspace. + +:: + + #include + #include + #include + #include + #include + int main() + { + __u16 max; + int fd = open("/dev/cdc-wdm0", O_RDWR); + if (!ioctl(fd, IOCTL_WDM_MAX_COMMAND, &max)) + printf("wMaxControlMessage is %d\n", max); + } + + +Custom device services +---------------------- +The MBIM specification allows vendors to freely define additional +services. This is fully supported by the cdc_mbim driver. + +Support for new MBIM services, including vendor specified services, is +implemented entirely in userspace, like the rest of the MBIM control +protocol + +New services should be registered in the MBIM Registry [5]. + + + +MBIM data channel userspace ABI +=============================== + +wwanY network device +-------------------- +The cdc_mbim driver represents the MBIM data channel as a single +network device of the "wwan" type. This network device is initially +mapped to MBIM IP session 0. + + +Multiplexed IP sessions (IPS) +----------------------------- +MBIM allows multiplexing up to 256 IP sessions over a single USB data +channel. The cdc_mbim driver models such IP sessions as 802.1q VLAN +subdevices of the master wwanY device, mapping MBIM IP session Z to +VLAN ID Z for all values of Z greater than 0. + +The device maximum Z is given in the MBIM_DEVICE_CAPS_INFO structure +described in section 10.5.1 of [1]. + +The userspace management application is responsible for adding new +VLAN links prior to establishing MBIM IP sessions where the SessionId +is greater than 0. These links can be added by using the normal VLAN +kernel interfaces, either ioctl or netlink. + +For example, adding a link for a MBIM IP session with SessionId 3:: + + ip link add link wwan0 name wwan0.3 type vlan id 3 + +The driver will automatically map the "wwan0.3" network device to MBIM +IP session 3. + + +Device Service Streams (DSS) +---------------------------- +MBIM also allows up to 256 non-IP data streams to be multiplexed over +the same shared USB data channel. The cdc_mbim driver models these +sessions as another set of 802.1q VLAN subdevices of the master wwanY +device, mapping MBIM DSS session A to VLAN ID (256 + A) for all values +of A. + +The device maximum A is given in the MBIM_DEVICE_SERVICES_INFO +structure described in section 10.5.29 of [1]. + +The DSS VLAN subdevices are used as a practical interface between the +shared MBIM data channel and a MBIM DSS aware userspace application. +It is not intended to be presented as-is to an end user. The +assumption is that a userspace application initiating a DSS session +also takes care of the necessary framing of the DSS data, presenting +the stream to the end user in an appropriate way for the stream type. + +The network device ABI requires a dummy ethernet header for every DSS +data frame being transported. The contents of this header is +arbitrary, with the following exceptions: + + - TX frames using an IP protocol (0x0800 or 0x86dd) will be dropped + - RX frames will have the protocol field set to ETH_P_802_3 (but will + not be properly formatted 802.3 frames) + - RX frames will have the destination address set to the hardware + address of the master device + +The DSS supporting userspace management application is responsible for +adding the dummy ethernet header on TX and stripping it on RX. + +This is a simple example using tools commonly available, exporting +DssSessionId 5 as a pty character device pointed to by a /dev/nmea +symlink:: + + ip link add link wwan0 name wwan0.dss5 type vlan id 261 + ip link set dev wwan0.dss5 up + socat INTERFACE:wwan0.dss5,type=2 PTY:,echo=0,link=/dev/nmea + +This is only an example, most suitable for testing out a DSS +service. Userspace applications supporting specific MBIM DSS services +are expected to use the tools and programming interfaces required by +that service. + +Note that adding VLAN links for DSS sessions is entirely optional. A +management application may instead choose to bind a packet socket +directly to the master network device, using the received VLAN tags to +map frames to the correct DSS session and adding 18 byte VLAN ethernet +headers with the appropriate tag on TX. In this case using a socket +filter is recommended, matching only the DSS VLAN subset. This avoid +unnecessary copying of unrelated IP session data to userspace. For +example:: + + static struct sock_filter dssfilter[] = { + /* use special negative offsets to get VLAN tag */ + BPF_STMT(BPF_LD|BPF_B|BPF_ABS, SKF_AD_OFF + SKF_AD_VLAN_TAG_PRESENT), + BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, 1, 0, 6), /* true */ + + /* verify DSS VLAN range */ + BPF_STMT(BPF_LD|BPF_H|BPF_ABS, SKF_AD_OFF + SKF_AD_VLAN_TAG), + BPF_JUMP(BPF_JMP|BPF_JGE|BPF_K, 256, 0, 4), /* 256 is first DSS VLAN */ + BPF_JUMP(BPF_JMP|BPF_JGE|BPF_K, 512, 3, 0), /* 511 is last DSS VLAN */ + + /* verify ethertype */ + BPF_STMT(BPF_LD|BPF_H|BPF_ABS, 2 * ETH_ALEN), + BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, ETH_P_802_3, 0, 1), + + BPF_STMT(BPF_RET|BPF_K, (u_int)-1), /* accept */ + BPF_STMT(BPF_RET|BPF_K, 0), /* ignore */ + }; + + + +Tagged IP session 0 VLAN +------------------------ +As described above, MBIM IP session 0 is treated as special by the +driver. It is initially mapped to untagged frames on the wwanY +network device. + +This mapping implies a few restrictions on multiplexed IPS and DSS +sessions, which may not always be practical: + + - no IPS or DSS session can use a frame size greater than the MTU on + IP session 0 + - no IPS or DSS session can be in the up state unless the network + device representing IP session 0 also is up + +These problems can be avoided by optionally making the driver map IP +session 0 to a VLAN subdevice, similar to all other IP sessions. This +behaviour is triggered by adding a VLAN link for the magic VLAN ID +4094. The driver will then immediately start mapping MBIM IP session +0 to this VLAN, and will drop untagged frames on the master wwanY +device. + +Tip: It might be less confusing to the end user to name this VLAN +subdevice after the MBIM SessionID instead of the VLAN ID. For +example:: + + ip link add link wwan0 name wwan0.0 type vlan id 4094 + + +VLAN mapping +------------ + +Summarizing the cdc_mbim driver mapping described above, we have this +relationship between VLAN tags on the wwanY network device and MBIM +sessions on the shared USB data channel:: + + VLAN ID MBIM type MBIM SessionID Notes + --------------------------------------------------------- + untagged IPS 0 a) + 1 - 255 IPS 1 - 255 + 256 - 511 DSS 0 - 255 + 512 - 4093 b) + 4094 IPS 0 c) + + a) if no VLAN ID 4094 link exists, else dropped + b) unsupported VLAN range, unconditionally dropped + c) if a VLAN ID 4094 link exists, else dropped + + + + +References +========== + + 1) USB Implementers Forum, Inc. - "Universal Serial Bus + Communications Class Subclass Specification for Mobile Broadband + Interface Model", Revision 1.0 (Errata 1), May 1, 2013 + + - http://www.usb.org/developers/docs/devclass_docs/ + + 2) USB Implementers Forum, Inc. - "Universal Serial Bus + Communications Class Subclass Specifications for Network Control + Model Devices", Revision 1.0 (Errata 1), November 24, 2010 + + - http://www.usb.org/developers/docs/devclass_docs/ + + 3) libmbim - "a glib-based library for talking to WWAN modems and + devices which speak the Mobile Interface Broadband Model (MBIM) + protocol" + + - http://www.freedesktop.org/wiki/Software/libmbim/ + + 4) ModemManager - "a DBus-activated daemon which controls mobile + broadband (2G/3G/4G) devices and connections" + + - http://www.freedesktop.org/wiki/Software/ModemManager/ + + 5) "MBIM (Mobile Broadband Interface Model) Registry" + + - http://compliance.usb.org/mbim/ + + 6) "/sys/kernel/debug/usb/devices output format" + + - Documentation/driver-api/usb/usb.rst + + 7) "/sys/bus/usb/devices/.../descriptors" + + - Documentation/ABI/stable/sysfs-bus-usb diff --git a/Documentation/networking/cdc_mbim.txt b/Documentation/networking/cdc_mbim.txt deleted file mode 100644 index 4e68f0bc5dba..000000000000 --- a/Documentation/networking/cdc_mbim.txt +++ /dev/null @@ -1,339 +0,0 @@ - cdc_mbim - Driver for CDC MBIM Mobile Broadband modems - ======================================================== - -The cdc_mbim driver supports USB devices conforming to the "Universal -Serial Bus Communications Class Subclass Specification for Mobile -Broadband Interface Model" [1], which is a further development of -"Universal Serial Bus Communications Class Subclass Specifications for -Network Control Model Devices" [2] optimized for Mobile Broadband -devices, aka "3G/LTE modems". - - -Command Line Parameters -======================= - -The cdc_mbim driver has no parameters of its own. But the probing -behaviour for NCM 1.0 backwards compatible MBIM functions (an -"NCM/MBIM function" as defined in section 3.2 of [1]) is affected -by a cdc_ncm driver parameter: - -prefer_mbim ------------ -Type: Boolean -Valid Range: N/Y (0-1) -Default Value: Y (MBIM is preferred) - -This parameter sets the system policy for NCM/MBIM functions. Such -functions will be handled by either the cdc_ncm driver or the cdc_mbim -driver depending on the prefer_mbim setting. Setting prefer_mbim=N -makes the cdc_mbim driver ignore these functions and lets the cdc_ncm -driver handle them instead. - -The parameter is writable, and can be changed at any time. A manual -unbind/bind is required to make the change effective for NCM/MBIM -functions bound to the "wrong" driver - - -Basic usage -=========== - -MBIM functions are inactive when unmanaged. The cdc_mbim driver only -provides a userspace interface to the MBIM control channel, and will -not participate in the management of the function. This implies that a -userspace MBIM management application always is required to enable a -MBIM function. - -Such userspace applications includes, but are not limited to: - - mbimcli (included with the libmbim [3] library), and - - ModemManager [4] - -Establishing a MBIM IP session reequires at least these actions by the -management application: - - open the control channel - - configure network connection settings - - connect to network - - configure IP interface - -Management application development ----------------------------------- -The driver <-> userspace interfaces are described below. The MBIM -control channel protocol is described in [1]. - - -MBIM control channel userspace ABI -================================== - -/dev/cdc-wdmX character device ------------------------------- -The driver creates a two-way pipe to the MBIM function control channel -using the cdc-wdm driver as a subdriver. The userspace end of the -control channel pipe is a /dev/cdc-wdmX character device. - -The cdc_mbim driver does not process or police messages on the control -channel. The channel is fully delegated to the userspace management -application. It is therefore up to this application to ensure that it -complies with all the control channel requirements in [1]. - -The cdc-wdmX device is created as a child of the MBIM control -interface USB device. The character device associated with a specific -MBIM function can be looked up using sysfs. For example: - - bjorn@nemi:~$ ls /sys/bus/usb/drivers/cdc_mbim/2-4:2.12/usbmisc - cdc-wdm0 - - bjorn@nemi:~$ grep . /sys/bus/usb/drivers/cdc_mbim/2-4:2.12/usbmisc/cdc-wdm0/dev - 180:0 - - -USB configuration descriptors ------------------------------ -The wMaxControlMessage field of the CDC MBIM functional descriptor -limits the maximum control message size. The managament application is -responsible for negotiating a control message size complying with the -requirements in section 9.3.1 of [1], taking this descriptor field -into consideration. - -The userspace application can access the CDC MBIM functional -descriptor of a MBIM function using either of the two USB -configuration descriptor kernel interfaces described in [6] or [7]. - -See also the ioctl documentation below. - - -Fragmentation -------------- -The userspace application is responsible for all control message -fragmentation and defragmentaion, as described in section 9.5 of [1]. - - -/dev/cdc-wdmX write() ---------------------- -The MBIM control messages from the management application *must not* -exceed the negotiated control message size. - - -/dev/cdc-wdmX read() --------------------- -The management application *must* accept control messages of up the -negotiated control message size. - - -/dev/cdc-wdmX ioctl() --------------------- -IOCTL_WDM_MAX_COMMAND: Get Maximum Command Size -This ioctl returns the wMaxControlMessage field of the CDC MBIM -functional descriptor for MBIM devices. This is intended as a -convenience, eliminating the need to parse the USB descriptors from -userspace. - - #include - #include - #include - #include - #include - int main() - { - __u16 max; - int fd = open("/dev/cdc-wdm0", O_RDWR); - if (!ioctl(fd, IOCTL_WDM_MAX_COMMAND, &max)) - printf("wMaxControlMessage is %d\n", max); - } - - -Custom device services ----------------------- -The MBIM specification allows vendors to freely define additional -services. This is fully supported by the cdc_mbim driver. - -Support for new MBIM services, including vendor specified services, is -implemented entirely in userspace, like the rest of the MBIM control -protocol - -New services should be registered in the MBIM Registry [5]. - - - -MBIM data channel userspace ABI -=============================== - -wwanY network device --------------------- -The cdc_mbim driver represents the MBIM data channel as a single -network device of the "wwan" type. This network device is initially -mapped to MBIM IP session 0. - - -Multiplexed IP sessions (IPS) ------------------------------ -MBIM allows multiplexing up to 256 IP sessions over a single USB data -channel. The cdc_mbim driver models such IP sessions as 802.1q VLAN -subdevices of the master wwanY device, mapping MBIM IP session Z to -VLAN ID Z for all values of Z greater than 0. - -The device maximum Z is given in the MBIM_DEVICE_CAPS_INFO structure -described in section 10.5.1 of [1]. - -The userspace management application is responsible for adding new -VLAN links prior to establishing MBIM IP sessions where the SessionId -is greater than 0. These links can be added by using the normal VLAN -kernel interfaces, either ioctl or netlink. - -For example, adding a link for a MBIM IP session with SessionId 3: - - ip link add link wwan0 name wwan0.3 type vlan id 3 - -The driver will automatically map the "wwan0.3" network device to MBIM -IP session 3. - - -Device Service Streams (DSS) ----------------------------- -MBIM also allows up to 256 non-IP data streams to be multiplexed over -the same shared USB data channel. The cdc_mbim driver models these -sessions as another set of 802.1q VLAN subdevices of the master wwanY -device, mapping MBIM DSS session A to VLAN ID (256 + A) for all values -of A. - -The device maximum A is given in the MBIM_DEVICE_SERVICES_INFO -structure described in section 10.5.29 of [1]. - -The DSS VLAN subdevices are used as a practical interface between the -shared MBIM data channel and a MBIM DSS aware userspace application. -It is not intended to be presented as-is to an end user. The -assumption is that a userspace application initiating a DSS session -also takes care of the necessary framing of the DSS data, presenting -the stream to the end user in an appropriate way for the stream type. - -The network device ABI requires a dummy ethernet header for every DSS -data frame being transported. The contents of this header is -arbitrary, with the following exceptions: - - TX frames using an IP protocol (0x0800 or 0x86dd) will be dropped - - RX frames will have the protocol field set to ETH_P_802_3 (but will - not be properly formatted 802.3 frames) - - RX frames will have the destination address set to the hardware - address of the master device - -The DSS supporting userspace management application is responsible for -adding the dummy ethernet header on TX and stripping it on RX. - -This is a simple example using tools commonly available, exporting -DssSessionId 5 as a pty character device pointed to by a /dev/nmea -symlink: - - ip link add link wwan0 name wwan0.dss5 type vlan id 261 - ip link set dev wwan0.dss5 up - socat INTERFACE:wwan0.dss5,type=2 PTY:,echo=0,link=/dev/nmea - -This is only an example, most suitable for testing out a DSS -service. Userspace applications supporting specific MBIM DSS services -are expected to use the tools and programming interfaces required by -that service. - -Note that adding VLAN links for DSS sessions is entirely optional. A -management application may instead choose to bind a packet socket -directly to the master network device, using the received VLAN tags to -map frames to the correct DSS session and adding 18 byte VLAN ethernet -headers with the appropriate tag on TX. In this case using a socket -filter is recommended, matching only the DSS VLAN subset. This avoid -unnecessary copying of unrelated IP session data to userspace. For -example: - - static struct sock_filter dssfilter[] = { - /* use special negative offsets to get VLAN tag */ - BPF_STMT(BPF_LD|BPF_B|BPF_ABS, SKF_AD_OFF + SKF_AD_VLAN_TAG_PRESENT), - BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, 1, 0, 6), /* true */ - - /* verify DSS VLAN range */ - BPF_STMT(BPF_LD|BPF_H|BPF_ABS, SKF_AD_OFF + SKF_AD_VLAN_TAG), - BPF_JUMP(BPF_JMP|BPF_JGE|BPF_K, 256, 0, 4), /* 256 is first DSS VLAN */ - BPF_JUMP(BPF_JMP|BPF_JGE|BPF_K, 512, 3, 0), /* 511 is last DSS VLAN */ - - /* verify ethertype */ - BPF_STMT(BPF_LD|BPF_H|BPF_ABS, 2 * ETH_ALEN), - BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, ETH_P_802_3, 0, 1), - - BPF_STMT(BPF_RET|BPF_K, (u_int)-1), /* accept */ - BPF_STMT(BPF_RET|BPF_K, 0), /* ignore */ - }; - - - -Tagged IP session 0 VLAN ------------------------- -As described above, MBIM IP session 0 is treated as special by the -driver. It is initially mapped to untagged frames on the wwanY -network device. - -This mapping implies a few restrictions on multiplexed IPS and DSS -sessions, which may not always be practical: - - no IPS or DSS session can use a frame size greater than the MTU on - IP session 0 - - no IPS or DSS session can be in the up state unless the network - device representing IP session 0 also is up - -These problems can be avoided by optionally making the driver map IP -session 0 to a VLAN subdevice, similar to all other IP sessions. This -behaviour is triggered by adding a VLAN link for the magic VLAN ID -4094. The driver will then immediately start mapping MBIM IP session -0 to this VLAN, and will drop untagged frames on the master wwanY -device. - -Tip: It might be less confusing to the end user to name this VLAN -subdevice after the MBIM SessionID instead of the VLAN ID. For -example: - - ip link add link wwan0 name wwan0.0 type vlan id 4094 - - -VLAN mapping ------------- - -Summarizing the cdc_mbim driver mapping described above, we have this -relationship between VLAN tags on the wwanY network device and MBIM -sessions on the shared USB data channel: - - VLAN ID MBIM type MBIM SessionID Notes - --------------------------------------------------------- - untagged IPS 0 a) - 1 - 255 IPS 1 - 255 - 256 - 511 DSS 0 - 255 - 512 - 4093 b) - 4094 IPS 0 c) - - a) if no VLAN ID 4094 link exists, else dropped - b) unsupported VLAN range, unconditionally dropped - c) if a VLAN ID 4094 link exists, else dropped - - - - -References -========== - -[1] USB Implementers Forum, Inc. - "Universal Serial Bus - Communications Class Subclass Specification for Mobile Broadband - Interface Model", Revision 1.0 (Errata 1), May 1, 2013 - - http://www.usb.org/developers/docs/devclass_docs/ - -[2] USB Implementers Forum, Inc. - "Universal Serial Bus - Communications Class Subclass Specifications for Network Control - Model Devices", Revision 1.0 (Errata 1), November 24, 2010 - - http://www.usb.org/developers/docs/devclass_docs/ - -[3] libmbim - "a glib-based library for talking to WWAN modems and - devices which speak the Mobile Interface Broadband Model (MBIM) - protocol" - - http://www.freedesktop.org/wiki/Software/libmbim/ - -[4] ModemManager - "a DBus-activated daemon which controls mobile - broadband (2G/3G/4G) devices and connections" - - http://www.freedesktop.org/wiki/Software/ModemManager/ - -[5] "MBIM (Mobile Broadband Interface Model) Registry" - - http://compliance.usb.org/mbim/ - -[6] "/sys/kernel/debug/usb/devices output format" - - Documentation/driver-api/usb/usb.rst - -[7] "/sys/bus/usb/devices/.../descriptors" - - Documentation/ABI/stable/sysfs-bus-usb diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 22b872834ef0..55802abd65a0 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -45,6 +45,7 @@ Contents: ax25 baycom bonding + cdc_mbim .. only:: subproject and html -- cgit v1.2.3 From 99b0e82dc5e36edb625f519121d4398628f05e95 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:26 +0200 Subject: docs: networking: convert cops.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/cops.rst | 80 ++++++++++++++++++++++++++++++++++++++ Documentation/networking/cops.txt | 63 ------------------------------ Documentation/networking/index.rst | 1 + drivers/net/appletalk/Kconfig | 2 +- 4 files changed, 82 insertions(+), 64 deletions(-) create mode 100644 Documentation/networking/cops.rst delete mode 100644 Documentation/networking/cops.txt (limited to 'Documentation') diff --git a/Documentation/networking/cops.rst b/Documentation/networking/cops.rst new file mode 100644 index 000000000000..964ba80599a9 --- /dev/null +++ b/Documentation/networking/cops.rst @@ -0,0 +1,80 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================== +The COPS LocalTalk Linux driver (cops.c) +======================================== + +By Jay Schulist + +This driver has two modes and they are: Dayna mode and Tangent mode. +Each mode corresponds with the type of card. It has been found +that there are 2 main types of cards and all other cards are +the same and just have different names or only have minor differences +such as more IO ports. As this driver is tested it will +become more clear exactly what cards are supported. + +Right now these cards are known to work with the COPS driver. The +LT-200 cards work in a somewhat more limited capacity than the +DL200 cards, which work very well and are in use by many people. + +TANGENT driver mode: + - Tangent ATB-II, Novell NL-1000, Daystar Digital LT-200 + +DAYNA driver mode: + - Dayna DL2000/DaynaTalk PC (Half Length), COPS LT-95, + - Farallon PhoneNET PC III, Farallon PhoneNET PC II + +Other cards possibly supported mode unknown though: + - Dayna DL2000 (Full length) + +The COPS driver defaults to using Dayna mode. To change the driver's +mode if you built a driver with dual support use board_type=1 or +board_type=2 for Dayna or Tangent with insmod. + +Operation/loading of the driver +=============================== + +Use modprobe like this: /sbin/modprobe cops.o (IO #) (IRQ #) +If you do not specify any options the driver will try and use the IO = 0x240, +IRQ = 5. As of right now I would only use IRQ 5 for the card, if autoprobing. + +To load multiple COPS driver Localtalk cards you can do one of the following:: + + insmod cops io=0x240 irq=5 + insmod -o cops2 cops io=0x260 irq=3 + +Or in lilo.conf put something like this:: + + append="ether=5,0x240,lt0 ether=3,0x260,lt1" + +Then bring up the interface with ifconfig. It will look something like this:: + + lt0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-F7-00-00-00-00-00-00-00-00 + inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0 + UP BROADCAST RUNNING NOARP MULTICAST MTU:600 Metric:1 + RX packets:0 errors:0 dropped:0 overruns:0 frame:0 + TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 coll:0 + +Netatalk Configuration +====================== + +You will need to configure atalkd with something like the following to make +it work with the cops.c driver. + +* For single LTalk card use:: + + dummy -seed -phase 2 -net 2000 -addr 2000.10 -zone "1033" + lt0 -seed -phase 1 -net 1000 -addr 1000.50 -zone "1033" + +* For multiple cards, Ethernet and LocalTalk:: + + eth0 -seed -phase 2 -net 3000 -addr 3000.20 -zone "1033" + lt0 -seed -phase 1 -net 1000 -addr 1000.50 -zone "1033" + +* For multiple LocalTalk cards, and an Ethernet card. + +* Order seems to matter here, Ethernet last:: + + lt0 -seed -phase 1 -net 1000 -addr 1000.10 -zone "LocalTalk1" + lt1 -seed -phase 1 -net 2000 -addr 2000.20 -zone "LocalTalk2" + eth0 -seed -phase 2 -net 3000 -addr 3000.30 -zone "EtherTalk" diff --git a/Documentation/networking/cops.txt b/Documentation/networking/cops.txt deleted file mode 100644 index 3e344b448e07..000000000000 --- a/Documentation/networking/cops.txt +++ /dev/null @@ -1,63 +0,0 @@ -Text File for the COPS LocalTalk Linux driver (cops.c). - By Jay Schulist - -This driver has two modes and they are: Dayna mode and Tangent mode. -Each mode corresponds with the type of card. It has been found -that there are 2 main types of cards and all other cards are -the same and just have different names or only have minor differences -such as more IO ports. As this driver is tested it will -become more clear exactly what cards are supported. - -Right now these cards are known to work with the COPS driver. The -LT-200 cards work in a somewhat more limited capacity than the -DL200 cards, which work very well and are in use by many people. - -TANGENT driver mode: - Tangent ATB-II, Novell NL-1000, Daystar Digital LT-200 -DAYNA driver mode: - Dayna DL2000/DaynaTalk PC (Half Length), COPS LT-95, - Farallon PhoneNET PC III, Farallon PhoneNET PC II -Other cards possibly supported mode unknown though: - Dayna DL2000 (Full length) - -The COPS driver defaults to using Dayna mode. To change the driver's -mode if you built a driver with dual support use board_type=1 or -board_type=2 for Dayna or Tangent with insmod. - -** Operation/loading of the driver. -Use modprobe like this: /sbin/modprobe cops.o (IO #) (IRQ #) -If you do not specify any options the driver will try and use the IO = 0x240, -IRQ = 5. As of right now I would only use IRQ 5 for the card, if autoprobing. - -To load multiple COPS driver Localtalk cards you can do one of the following. - -insmod cops io=0x240 irq=5 -insmod -o cops2 cops io=0x260 irq=3 - -Or in lilo.conf put something like this: - append="ether=5,0x240,lt0 ether=3,0x260,lt1" - -Then bring up the interface with ifconfig. It will look something like this: -lt0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-F7-00-00-00-00-00-00-00-00 - inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0 - UP BROADCAST RUNNING NOARP MULTICAST MTU:600 Metric:1 - RX packets:0 errors:0 dropped:0 overruns:0 frame:0 - TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 coll:0 - -** Netatalk Configuration -You will need to configure atalkd with something like the following to make -it work with the cops.c driver. - -* For single LTalk card use. -dummy -seed -phase 2 -net 2000 -addr 2000.10 -zone "1033" -lt0 -seed -phase 1 -net 1000 -addr 1000.50 -zone "1033" - -* For multiple cards, Ethernet and LocalTalk. -eth0 -seed -phase 2 -net 3000 -addr 3000.20 -zone "1033" -lt0 -seed -phase 1 -net 1000 -addr 1000.50 -zone "1033" - -* For multiple LocalTalk cards, and an Ethernet card. -* Order seems to matter here, Ethernet last. -lt0 -seed -phase 1 -net 1000 -addr 1000.10 -zone "LocalTalk1" -lt1 -seed -phase 1 -net 2000 -addr 2000.20 -zone "LocalTalk2" -eth0 -seed -phase 2 -net 3000 -addr 3000.30 -zone "EtherTalk" diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 55802abd65a0..7b596810d479 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -46,6 +46,7 @@ Contents: baycom bonding cdc_mbim + cops .. only:: subproject and html diff --git a/drivers/net/appletalk/Kconfig b/drivers/net/appletalk/Kconfig index af509b05ac5c..d4e51c048f62 100644 --- a/drivers/net/appletalk/Kconfig +++ b/drivers/net/appletalk/Kconfig @@ -59,7 +59,7 @@ config COPS package. This driver is experimental, which means that it may not work. This driver will only work if you choose "AppleTalk DDP" networking support, above. - Please read the file . + Please read the file . config COPS_DAYNA bool "Dayna firmware support" -- cgit v1.2.3 From 9a9891fbdf935c270388fca856c117ad71c02458 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:27 +0200 Subject: docs: networking: convert cxacru.txt to ReST - add SPDX header; - add a document title; - mark code blocks and literals as such; - mark lists as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/cxacru.rst | 120 ++++++++++++++++++++++++++++++++++++ Documentation/networking/cxacru.txt | 100 ------------------------------ Documentation/networking/index.rst | 1 + 3 files changed, 121 insertions(+), 100 deletions(-) create mode 100644 Documentation/networking/cxacru.rst delete mode 100644 Documentation/networking/cxacru.txt (limited to 'Documentation') diff --git a/Documentation/networking/cxacru.rst b/Documentation/networking/cxacru.rst new file mode 100644 index 000000000000..6088af2ffeda --- /dev/null +++ b/Documentation/networking/cxacru.rst @@ -0,0 +1,120 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================== +ATM cxacru device driver +======================== + +Firmware is required for this device: http://accessrunner.sourceforge.net/ + +While it is capable of managing/maintaining the ADSL connection without the +module loaded, the device will sometimes stop responding after unloading the +driver and it is necessary to unplug/remove power to the device to fix this. + +Note: support for cxacru-cf.bin has been removed. It was not loaded correctly +so it had no effect on the device configuration. Fixing it could have stopped +existing devices working when an invalid configuration is supplied. + +There is a script cxacru-cf.py to convert an existing file to the sysfs form. + +Detected devices will appear as ATM devices named "cxacru". In /sys/class/atm/ +these are directories named cxacruN where N is the device number. A symlink +named device points to the USB interface device's directory which contains +several sysfs attribute files for retrieving device statistics: + +* adsl_controller_version + +* adsl_headend +* adsl_headend_environment + + - Information about the remote headend. + +* adsl_config + + - Configuration writing interface. + - Write parameters in hexadecimal format =, + separated by whitespace, e.g.: + + "1=0 a=5" + + - Up to 7 parameters at a time will be sent and the modem will restart + the ADSL connection when any value is set. These are logged for future + reference. + +* downstream_attenuation (dB) +* downstream_bits_per_frame +* downstream_rate (kbps) +* downstream_snr_margin (dB) + + - Downstream stats. + +* upstream_attenuation (dB) +* upstream_bits_per_frame +* upstream_rate (kbps) +* upstream_snr_margin (dB) +* transmitter_power (dBm/Hz) + + - Upstream stats. + +* downstream_crc_errors +* downstream_fec_errors +* downstream_hec_errors +* upstream_crc_errors +* upstream_fec_errors +* upstream_hec_errors + + - Error counts. + +* line_startable + + - Indicates that ADSL support on the device + is/can be enabled, see adsl_start. + +* line_status + + - "initialising" + - "down" + - "attempting to activate" + - "training" + - "channel analysis" + - "exchange" + - "waiting" + - "up" + + Changes between "down" and "attempting to activate" + if there is no signal. + +* link_status + + - "not connected" + - "connected" + - "lost" + +* mac_address + +* modulation + + - "" (when not connected) + - "ANSI T1.413" + - "ITU-T G.992.1 (G.DMT)" + - "ITU-T G.992.2 (G.LITE)" + +* startup_attempts + + - Count of total attempts to initialise ADSL. + +To enable/disable ADSL, the following can be written to the adsl_state file: + + - "start" + - "stop + - "restart" (stops, waits 1.5s, then starts) + - "poll" (used to resume status polling if it was disabled due to failure) + +Changes in adsl/line state are reported via kernel log messages:: + + [4942145.150704] ATM dev 0: ADSL state: running + [4942243.663766] ATM dev 0: ADSL line: down + [4942249.665075] ATM dev 0: ADSL line: attempting to activate + [4942253.654954] ATM dev 0: ADSL line: training + [4942255.666387] ATM dev 0: ADSL line: channel analysis + [4942259.656262] ATM dev 0: ADSL line: exchange + [2635357.696901] ATM dev 0: ADSL line: up (8128 kb/s down | 832 kb/s up) diff --git a/Documentation/networking/cxacru.txt b/Documentation/networking/cxacru.txt deleted file mode 100644 index 2cce04457b4d..000000000000 --- a/Documentation/networking/cxacru.txt +++ /dev/null @@ -1,100 +0,0 @@ -Firmware is required for this device: http://accessrunner.sourceforge.net/ - -While it is capable of managing/maintaining the ADSL connection without the -module loaded, the device will sometimes stop responding after unloading the -driver and it is necessary to unplug/remove power to the device to fix this. - -Note: support for cxacru-cf.bin has been removed. It was not loaded correctly -so it had no effect on the device configuration. Fixing it could have stopped -existing devices working when an invalid configuration is supplied. - -There is a script cxacru-cf.py to convert an existing file to the sysfs form. - -Detected devices will appear as ATM devices named "cxacru". In /sys/class/atm/ -these are directories named cxacruN where N is the device number. A symlink -named device points to the USB interface device's directory which contains -several sysfs attribute files for retrieving device statistics: - -* adsl_controller_version - -* adsl_headend -* adsl_headend_environment - Information about the remote headend. - -* adsl_config - Configuration writing interface. - Write parameters in hexadecimal format =, - separated by whitespace, e.g.: - "1=0 a=5" - Up to 7 parameters at a time will be sent and the modem will restart - the ADSL connection when any value is set. These are logged for future - reference. - -* downstream_attenuation (dB) -* downstream_bits_per_frame -* downstream_rate (kbps) -* downstream_snr_margin (dB) - Downstream stats. - -* upstream_attenuation (dB) -* upstream_bits_per_frame -* upstream_rate (kbps) -* upstream_snr_margin (dB) -* transmitter_power (dBm/Hz) - Upstream stats. - -* downstream_crc_errors -* downstream_fec_errors -* downstream_hec_errors -* upstream_crc_errors -* upstream_fec_errors -* upstream_hec_errors - Error counts. - -* line_startable - Indicates that ADSL support on the device - is/can be enabled, see adsl_start. - -* line_status - "initialising" - "down" - "attempting to activate" - "training" - "channel analysis" - "exchange" - "waiting" - "up" - - Changes between "down" and "attempting to activate" - if there is no signal. - -* link_status - "not connected" - "connected" - "lost" - -* mac_address - -* modulation - "" (when not connected) - "ANSI T1.413" - "ITU-T G.992.1 (G.DMT)" - "ITU-T G.992.2 (G.LITE)" - -* startup_attempts - Count of total attempts to initialise ADSL. - -To enable/disable ADSL, the following can be written to the adsl_state file: - "start" - "stop - "restart" (stops, waits 1.5s, then starts) - "poll" (used to resume status polling if it was disabled due to failure) - -Changes in adsl/line state are reported via kernel log messages: - [4942145.150704] ATM dev 0: ADSL state: running - [4942243.663766] ATM dev 0: ADSL line: down - [4942249.665075] ATM dev 0: ADSL line: attempting to activate - [4942253.654954] ATM dev 0: ADSL line: training - [4942255.666387] ATM dev 0: ADSL line: channel analysis - [4942259.656262] ATM dev 0: ADSL line: exchange - [2635357.696901] ATM dev 0: ADSL line: up (8128 kb/s down | 832 kb/s up) diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 7b596810d479..4c8e896490e0 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -47,6 +47,7 @@ Contents: bonding cdc_mbim cops + cxacru .. only:: subproject and html -- cgit v1.2.3 From 33155bac6519f545137d9c46d2e59e5f8332dd50 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:28 +0200 Subject: docs: networking: convert dccp.txt to ReST - add SPDX header; - adjust title markup; - comment out text-only TOC from html/pdf output; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/dccp.rst | 216 +++++++++++++++++++++++++++++++++++++ Documentation/networking/dccp.txt | 207 ----------------------------------- Documentation/networking/index.rst | 1 + 3 files changed, 217 insertions(+), 207 deletions(-) create mode 100644 Documentation/networking/dccp.rst delete mode 100644 Documentation/networking/dccp.txt (limited to 'Documentation') diff --git a/Documentation/networking/dccp.rst b/Documentation/networking/dccp.rst new file mode 100644 index 000000000000..dde16be04456 --- /dev/null +++ b/Documentation/networking/dccp.rst @@ -0,0 +1,216 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +DCCP protocol +============= + + +.. Contents + - Introduction + - Missing features + - Socket options + - Sysctl variables + - IOCTLs + - Other tunables + - Notes + + +Introduction +============ +Datagram Congestion Control Protocol (DCCP) is an unreliable, connection +oriented protocol designed to solve issues present in UDP and TCP, particularly +for real-time and multimedia (streaming) traffic. +It divides into a base protocol (RFC 4340) and pluggable congestion control +modules called CCIDs. Like pluggable TCP congestion control, at least one CCID +needs to be enabled in order for the protocol to function properly. In the Linux +implementation, this is the TCP-like CCID2 (RFC 4341). Additional CCIDs, such as +the TCP-friendly CCID3 (RFC 4342), are optional. +For a brief introduction to CCIDs and suggestions for choosing a CCID to match +given applications, see section 10 of RFC 4340. + +It has a base protocol and pluggable congestion control IDs (CCIDs). + +DCCP is a Proposed Standard (RFC 2026), and the homepage for DCCP as a protocol +is at http://www.ietf.org/html.charters/dccp-charter.html + + +Missing features +================ +The Linux DCCP implementation does not currently support all the features that are +specified in RFCs 4340...42. + +The known bugs are at: + + http://www.linuxfoundation.org/collaborate/workgroups/networking/todo#DCCP + +For more up-to-date versions of the DCCP implementation, please consider using +the experimental DCCP test tree; instructions for checking this out are on: +http://www.linuxfoundation.org/collaborate/workgroups/networking/dccp_testing#Experimental_DCCP_source_tree + + +Socket options +============== +DCCP_SOCKOPT_QPOLICY_ID sets the dequeuing policy for outgoing packets. It takes +a policy ID as argument and can only be set before the connection (i.e. changes +during an established connection are not supported). Currently, two policies are +defined: the "simple" policy (DCCPQ_POLICY_SIMPLE), which does nothing special, +and a priority-based variant (DCCPQ_POLICY_PRIO). The latter allows to pass an +u32 priority value as ancillary data to sendmsg(), where higher numbers indicate +a higher packet priority (similar to SO_PRIORITY). This ancillary data needs to +be formatted using a cmsg(3) message header filled in as follows:: + + cmsg->cmsg_level = SOL_DCCP; + cmsg->cmsg_type = DCCP_SCM_PRIORITY; + cmsg->cmsg_len = CMSG_LEN(sizeof(uint32_t)); /* or CMSG_LEN(4) */ + +DCCP_SOCKOPT_QPOLICY_TXQLEN sets the maximum length of the output queue. A zero +value is always interpreted as unbounded queue length. If different from zero, +the interpretation of this parameter depends on the current dequeuing policy +(see above): the "simple" policy will enforce a fixed queue size by returning +EAGAIN, whereas the "prio" policy enforces a fixed queue length by dropping the +lowest-priority packet first. The default value for this parameter is +initialised from /proc/sys/net/dccp/default/tx_qlen. + +DCCP_SOCKOPT_SERVICE sets the service. The specification mandates use of +service codes (RFC 4340, sec. 8.1.2); if this socket option is not set, +the socket will fall back to 0 (which means that no meaningful service code +is present). On active sockets this is set before connect(); specifying more +than one code has no effect (all subsequent service codes are ignored). The +case is different for passive sockets, where multiple service codes (up to 32) +can be set before calling bind(). + +DCCP_SOCKOPT_GET_CUR_MPS is read-only and retrieves the current maximum packet +size (application payload size) in bytes, see RFC 4340, section 14. + +DCCP_SOCKOPT_AVAILABLE_CCIDS is also read-only and returns the list of CCIDs +supported by the endpoint. The option value is an array of type uint8_t whose +size is passed as option length. The minimum array size is 4 elements, the +value returned in the optlen argument always reflects the true number of +built-in CCIDs. + +DCCP_SOCKOPT_CCID is write-only and sets both the TX and RX CCIDs at the same +time, combining the operation of the next two socket options. This option is +preferable over the latter two, since often applications will use the same +type of CCID for both directions; and mixed use of CCIDs is not currently well +understood. This socket option takes as argument at least one uint8_t value, or +an array of uint8_t values, which must match available CCIDS (see above). CCIDs +must be registered on the socket before calling connect() or listen(). + +DCCP_SOCKOPT_TX_CCID is read/write. It returns the current CCID (if set) or sets +the preference list for the TX CCID, using the same format as DCCP_SOCKOPT_CCID. +Please note that the getsockopt argument type here is ``int``, not uint8_t. + +DCCP_SOCKOPT_RX_CCID is analogous to DCCP_SOCKOPT_TX_CCID, but for the RX CCID. + +DCCP_SOCKOPT_SERVER_TIMEWAIT enables the server (listening socket) to hold +timewait state when closing the connection (RFC 4340, 8.3). The usual case is +that the closing server sends a CloseReq, whereupon the client holds timewait +state. When this boolean socket option is on, the server sends a Close instead +and will enter TIMEWAIT. This option must be set after accept() returns. + +DCCP_SOCKOPT_SEND_CSCOV and DCCP_SOCKOPT_RECV_CSCOV are used for setting the +partial checksum coverage (RFC 4340, sec. 9.2). The default is that checksums +always cover the entire packet and that only fully covered application data is +accepted by the receiver. Hence, when using this feature on the sender, it must +be enabled at the receiver, too with suitable choice of CsCov. + +DCCP_SOCKOPT_SEND_CSCOV sets the sender checksum coverage. Values in the + range 0..15 are acceptable. The default setting is 0 (full coverage), + values between 1..15 indicate partial coverage. + +DCCP_SOCKOPT_RECV_CSCOV is for the receiver and has a different meaning: it + sets a threshold, where again values 0..15 are acceptable. The default + of 0 means that all packets with a partial coverage will be discarded. + Values in the range 1..15 indicate that packets with minimally such a + coverage value are also acceptable. The higher the number, the more + restrictive this setting (see [RFC 4340, sec. 9.2.1]). Partial coverage + settings are inherited to the child socket after accept(). + +The following two options apply to CCID 3 exclusively and are getsockopt()-only. +In either case, a TFRC info struct (defined in ) is returned. + +DCCP_SOCKOPT_CCID_RX_INFO + Returns a ``struct tfrc_rx_info`` in optval; the buffer for optval and + optlen must be set to at least sizeof(struct tfrc_rx_info). + +DCCP_SOCKOPT_CCID_TX_INFO + Returns a ``struct tfrc_tx_info`` in optval; the buffer for optval and + optlen must be set to at least sizeof(struct tfrc_tx_info). + +On unidirectional connections it is useful to close the unused half-connection +via shutdown (SHUT_WR or SHUT_RD): this will reduce per-packet processing costs. + + +Sysctl variables +================ +Several DCCP default parameters can be managed by the following sysctls +(sysctl net.dccp.default or /proc/sys/net/dccp/default): + +request_retries + The number of active connection initiation retries (the number of + Requests minus one) before timing out. In addition, it also governs + the behaviour of the other, passive side: this variable also sets + the number of times DCCP repeats sending a Response when the initial + handshake does not progress from RESPOND to OPEN (i.e. when no Ack + is received after the initial Request). This value should be greater + than 0, suggested is less than 10. Analogue of tcp_syn_retries. + +retries1 + How often a DCCP Response is retransmitted until the listening DCCP + side considers its connecting peer dead. Analogue of tcp_retries1. + +retries2 + The number of times a general DCCP packet is retransmitted. This has + importance for retransmitted acknowledgments and feature negotiation, + data packets are never retransmitted. Analogue of tcp_retries2. + +tx_ccid = 2 + Default CCID for the sender-receiver half-connection. Depending on the + choice of CCID, the Send Ack Vector feature is enabled automatically. + +rx_ccid = 2 + Default CCID for the receiver-sender half-connection; see tx_ccid. + +seq_window = 100 + The initial sequence window (sec. 7.5.2) of the sender. This influences + the local ackno validity and the remote seqno validity windows (7.5.1). + Values in the range Wmin = 32 (RFC 4340, 7.5.2) up to 2^32-1 can be set. + +tx_qlen = 5 + The size of the transmit buffer in packets. A value of 0 corresponds + to an unbounded transmit buffer. + +sync_ratelimit = 125 ms + The timeout between subsequent DCCP-Sync packets sent in response to + sequence-invalid packets on the same socket (RFC 4340, 7.5.4). The unit + of this parameter is milliseconds; a value of 0 disables rate-limiting. + + +IOCTLS +====== +FIONREAD + Works as in udp(7): returns in the ``int`` argument pointer the size of + the next pending datagram in bytes, or 0 when no datagram is pending. + + +Other tunables +============== +Per-route rto_min support + CCID-2 supports the RTAX_RTO_MIN per-route setting for the minimum value + of the RTO timer. This setting can be modified via the 'rto_min' option + of iproute2; for example:: + + > ip route change 10.0.0.0/24 rto_min 250j dev wlan0 + > ip route add 10.0.0.254/32 rto_min 800j dev wlan0 + > ip route show dev wlan0 + + CCID-3 also supports the rto_min setting: it is used to define the lower + bound for the expiry of the nofeedback timer. This can be useful on LANs + with very low RTTs (e.g., loopback, Gbit ethernet). + + +Notes +===== +DCCP does not travel through NAT successfully at present on many boxes. This is +because the checksum covers the pseudo-header as per TCP and UDP. Linux NAT +support for DCCP has been added. diff --git a/Documentation/networking/dccp.txt b/Documentation/networking/dccp.txt deleted file mode 100644 index 55c575fcaf17..000000000000 --- a/Documentation/networking/dccp.txt +++ /dev/null @@ -1,207 +0,0 @@ -DCCP protocol -============= - - -Contents -======== -- Introduction -- Missing features -- Socket options -- Sysctl variables -- IOCTLs -- Other tunables -- Notes - - -Introduction -============ -Datagram Congestion Control Protocol (DCCP) is an unreliable, connection -oriented protocol designed to solve issues present in UDP and TCP, particularly -for real-time and multimedia (streaming) traffic. -It divides into a base protocol (RFC 4340) and pluggable congestion control -modules called CCIDs. Like pluggable TCP congestion control, at least one CCID -needs to be enabled in order for the protocol to function properly. In the Linux -implementation, this is the TCP-like CCID2 (RFC 4341). Additional CCIDs, such as -the TCP-friendly CCID3 (RFC 4342), are optional. -For a brief introduction to CCIDs and suggestions for choosing a CCID to match -given applications, see section 10 of RFC 4340. - -It has a base protocol and pluggable congestion control IDs (CCIDs). - -DCCP is a Proposed Standard (RFC 2026), and the homepage for DCCP as a protocol -is at http://www.ietf.org/html.charters/dccp-charter.html - - -Missing features -================ -The Linux DCCP implementation does not currently support all the features that are -specified in RFCs 4340...42. - -The known bugs are at: - http://www.linuxfoundation.org/collaborate/workgroups/networking/todo#DCCP - -For more up-to-date versions of the DCCP implementation, please consider using -the experimental DCCP test tree; instructions for checking this out are on: -http://www.linuxfoundation.org/collaborate/workgroups/networking/dccp_testing#Experimental_DCCP_source_tree - - -Socket options -============== -DCCP_SOCKOPT_QPOLICY_ID sets the dequeuing policy for outgoing packets. It takes -a policy ID as argument and can only be set before the connection (i.e. changes -during an established connection are not supported). Currently, two policies are -defined: the "simple" policy (DCCPQ_POLICY_SIMPLE), which does nothing special, -and a priority-based variant (DCCPQ_POLICY_PRIO). The latter allows to pass an -u32 priority value as ancillary data to sendmsg(), where higher numbers indicate -a higher packet priority (similar to SO_PRIORITY). This ancillary data needs to -be formatted using a cmsg(3) message header filled in as follows: - cmsg->cmsg_level = SOL_DCCP; - cmsg->cmsg_type = DCCP_SCM_PRIORITY; - cmsg->cmsg_len = CMSG_LEN(sizeof(uint32_t)); /* or CMSG_LEN(4) */ - -DCCP_SOCKOPT_QPOLICY_TXQLEN sets the maximum length of the output queue. A zero -value is always interpreted as unbounded queue length. If different from zero, -the interpretation of this parameter depends on the current dequeuing policy -(see above): the "simple" policy will enforce a fixed queue size by returning -EAGAIN, whereas the "prio" policy enforces a fixed queue length by dropping the -lowest-priority packet first. The default value for this parameter is -initialised from /proc/sys/net/dccp/default/tx_qlen. - -DCCP_SOCKOPT_SERVICE sets the service. The specification mandates use of -service codes (RFC 4340, sec. 8.1.2); if this socket option is not set, -the socket will fall back to 0 (which means that no meaningful service code -is present). On active sockets this is set before connect(); specifying more -than one code has no effect (all subsequent service codes are ignored). The -case is different for passive sockets, where multiple service codes (up to 32) -can be set before calling bind(). - -DCCP_SOCKOPT_GET_CUR_MPS is read-only and retrieves the current maximum packet -size (application payload size) in bytes, see RFC 4340, section 14. - -DCCP_SOCKOPT_AVAILABLE_CCIDS is also read-only and returns the list of CCIDs -supported by the endpoint. The option value is an array of type uint8_t whose -size is passed as option length. The minimum array size is 4 elements, the -value returned in the optlen argument always reflects the true number of -built-in CCIDs. - -DCCP_SOCKOPT_CCID is write-only and sets both the TX and RX CCIDs at the same -time, combining the operation of the next two socket options. This option is -preferable over the latter two, since often applications will use the same -type of CCID for both directions; and mixed use of CCIDs is not currently well -understood. This socket option takes as argument at least one uint8_t value, or -an array of uint8_t values, which must match available CCIDS (see above). CCIDs -must be registered on the socket before calling connect() or listen(). - -DCCP_SOCKOPT_TX_CCID is read/write. It returns the current CCID (if set) or sets -the preference list for the TX CCID, using the same format as DCCP_SOCKOPT_CCID. -Please note that the getsockopt argument type here is `int', not uint8_t. - -DCCP_SOCKOPT_RX_CCID is analogous to DCCP_SOCKOPT_TX_CCID, but for the RX CCID. - -DCCP_SOCKOPT_SERVER_TIMEWAIT enables the server (listening socket) to hold -timewait state when closing the connection (RFC 4340, 8.3). The usual case is -that the closing server sends a CloseReq, whereupon the client holds timewait -state. When this boolean socket option is on, the server sends a Close instead -and will enter TIMEWAIT. This option must be set after accept() returns. - -DCCP_SOCKOPT_SEND_CSCOV and DCCP_SOCKOPT_RECV_CSCOV are used for setting the -partial checksum coverage (RFC 4340, sec. 9.2). The default is that checksums -always cover the entire packet and that only fully covered application data is -accepted by the receiver. Hence, when using this feature on the sender, it must -be enabled at the receiver, too with suitable choice of CsCov. - -DCCP_SOCKOPT_SEND_CSCOV sets the sender checksum coverage. Values in the - range 0..15 are acceptable. The default setting is 0 (full coverage), - values between 1..15 indicate partial coverage. -DCCP_SOCKOPT_RECV_CSCOV is for the receiver and has a different meaning: it - sets a threshold, where again values 0..15 are acceptable. The default - of 0 means that all packets with a partial coverage will be discarded. - Values in the range 1..15 indicate that packets with minimally such a - coverage value are also acceptable. The higher the number, the more - restrictive this setting (see [RFC 4340, sec. 9.2.1]). Partial coverage - settings are inherited to the child socket after accept(). - -The following two options apply to CCID 3 exclusively and are getsockopt()-only. -In either case, a TFRC info struct (defined in ) is returned. -DCCP_SOCKOPT_CCID_RX_INFO - Returns a `struct tfrc_rx_info' in optval; the buffer for optval and - optlen must be set to at least sizeof(struct tfrc_rx_info). -DCCP_SOCKOPT_CCID_TX_INFO - Returns a `struct tfrc_tx_info' in optval; the buffer for optval and - optlen must be set to at least sizeof(struct tfrc_tx_info). - -On unidirectional connections it is useful to close the unused half-connection -via shutdown (SHUT_WR or SHUT_RD): this will reduce per-packet processing costs. - - -Sysctl variables -================ -Several DCCP default parameters can be managed by the following sysctls -(sysctl net.dccp.default or /proc/sys/net/dccp/default): - -request_retries - The number of active connection initiation retries (the number of - Requests minus one) before timing out. In addition, it also governs - the behaviour of the other, passive side: this variable also sets - the number of times DCCP repeats sending a Response when the initial - handshake does not progress from RESPOND to OPEN (i.e. when no Ack - is received after the initial Request). This value should be greater - than 0, suggested is less than 10. Analogue of tcp_syn_retries. - -retries1 - How often a DCCP Response is retransmitted until the listening DCCP - side considers its connecting peer dead. Analogue of tcp_retries1. - -retries2 - The number of times a general DCCP packet is retransmitted. This has - importance for retransmitted acknowledgments and feature negotiation, - data packets are never retransmitted. Analogue of tcp_retries2. - -tx_ccid = 2 - Default CCID for the sender-receiver half-connection. Depending on the - choice of CCID, the Send Ack Vector feature is enabled automatically. - -rx_ccid = 2 - Default CCID for the receiver-sender half-connection; see tx_ccid. - -seq_window = 100 - The initial sequence window (sec. 7.5.2) of the sender. This influences - the local ackno validity and the remote seqno validity windows (7.5.1). - Values in the range Wmin = 32 (RFC 4340, 7.5.2) up to 2^32-1 can be set. - -tx_qlen = 5 - The size of the transmit buffer in packets. A value of 0 corresponds - to an unbounded transmit buffer. - -sync_ratelimit = 125 ms - The timeout between subsequent DCCP-Sync packets sent in response to - sequence-invalid packets on the same socket (RFC 4340, 7.5.4). The unit - of this parameter is milliseconds; a value of 0 disables rate-limiting. - - -IOCTLS -====== -FIONREAD - Works as in udp(7): returns in the `int' argument pointer the size of - the next pending datagram in bytes, or 0 when no datagram is pending. - - -Other tunables -============== -Per-route rto_min support - CCID-2 supports the RTAX_RTO_MIN per-route setting for the minimum value - of the RTO timer. This setting can be modified via the 'rto_min' option - of iproute2; for example: - > ip route change 10.0.0.0/24 rto_min 250j dev wlan0 - > ip route add 10.0.0.254/32 rto_min 800j dev wlan0 - > ip route show dev wlan0 - CCID-3 also supports the rto_min setting: it is used to define the lower - bound for the expiry of the nofeedback timer. This can be useful on LANs - with very low RTTs (e.g., loopback, Gbit ethernet). - - -Notes -===== -DCCP does not travel through NAT successfully at present on many boxes. This is -because the checksum covers the pseudo-header as per TCP and UDP. Linux NAT -support for DCCP has been added. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 4c8e896490e0..3894043332de 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -48,6 +48,7 @@ Contents: cdc_mbim cops cxacru + dccp .. only:: subproject and html -- cgit v1.2.3 From 8447bb44ef7c452cbc94c04fc38d4946a3ef9165 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:29 +0200 Subject: docs: networking: convert dctcp.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/dctcp.rst | 52 ++++++++++++++++++++++++++++++++++++++ Documentation/networking/dctcp.txt | 44 -------------------------------- Documentation/networking/index.rst | 1 + 3 files changed, 53 insertions(+), 44 deletions(-) create mode 100644 Documentation/networking/dctcp.rst delete mode 100644 Documentation/networking/dctcp.txt (limited to 'Documentation') diff --git a/Documentation/networking/dctcp.rst b/Documentation/networking/dctcp.rst new file mode 100644 index 000000000000..4cc8bb2dad50 --- /dev/null +++ b/Documentation/networking/dctcp.rst @@ -0,0 +1,52 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================== +DCTCP (DataCenter TCP) +====================== + +DCTCP is an enhancement to the TCP congestion control algorithm for data +center networks and leverages Explicit Congestion Notification (ECN) in +the data center network to provide multi-bit feedback to the end hosts. + +To enable it on end hosts:: + + sysctl -w net.ipv4.tcp_congestion_control=dctcp + sysctl -w net.ipv4.tcp_ecn_fallback=0 (optional) + +All switches in the data center network running DCTCP must support ECN +marking and be configured for marking when reaching defined switch buffer +thresholds. The default ECN marking threshold heuristic for DCTCP on +switches is 20 packets (30KB) at 1Gbps, and 65 packets (~100KB) at 10Gbps, +but might need further careful tweaking. + +For more details, see below documents: + +Paper: + +The algorithm is further described in detail in the following two +SIGCOMM/SIGMETRICS papers: + + i) Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye, + Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan: + + "Data Center TCP (DCTCP)", Data Center Networks session" + + Proc. ACM SIGCOMM, New Delhi, 2010. + + http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf + http://www.sigcomm.org/ccr/papers/2010/October/1851275.1851192 + +ii) Mohammad Alizadeh, Adel Javanmard, and Balaji Prabhakar: + + "Analysis of DCTCP: Stability, Convergence, and Fairness" + Proc. ACM SIGMETRICS, San Jose, 2011. + + http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf + +IETF informational draft: + + http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00 + +DCTCP site: + + http://simula.stanford.edu/~alizade/Site/DCTCP.html diff --git a/Documentation/networking/dctcp.txt b/Documentation/networking/dctcp.txt deleted file mode 100644 index 13a857753208..000000000000 --- a/Documentation/networking/dctcp.txt +++ /dev/null @@ -1,44 +0,0 @@ -DCTCP (DataCenter TCP) ----------------------- - -DCTCP is an enhancement to the TCP congestion control algorithm for data -center networks and leverages Explicit Congestion Notification (ECN) in -the data center network to provide multi-bit feedback to the end hosts. - -To enable it on end hosts: - - sysctl -w net.ipv4.tcp_congestion_control=dctcp - sysctl -w net.ipv4.tcp_ecn_fallback=0 (optional) - -All switches in the data center network running DCTCP must support ECN -marking and be configured for marking when reaching defined switch buffer -thresholds. The default ECN marking threshold heuristic for DCTCP on -switches is 20 packets (30KB) at 1Gbps, and 65 packets (~100KB) at 10Gbps, -but might need further careful tweaking. - -For more details, see below documents: - -Paper: - -The algorithm is further described in detail in the following two -SIGCOMM/SIGMETRICS papers: - - i) Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye, - Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan: - "Data Center TCP (DCTCP)", Data Center Networks session - Proc. ACM SIGCOMM, New Delhi, 2010. - http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf - http://www.sigcomm.org/ccr/papers/2010/October/1851275.1851192 - -ii) Mohammad Alizadeh, Adel Javanmard, and Balaji Prabhakar: - "Analysis of DCTCP: Stability, Convergence, and Fairness" - Proc. ACM SIGMETRICS, San Jose, 2011. - http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf - -IETF informational draft: - - http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00 - -DCTCP site: - - http://simula.stanford.edu/~alizade/Site/DCTCP.html diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 3894043332de..9e83d3bda4e0 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -49,6 +49,7 @@ Contents: cops cxacru dccp + dctcp .. only:: subproject and html -- cgit v1.2.3 From 9a69fb9c21c4bf4107becb877729544759bdd059 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:30 +0200 Subject: docs: networking: convert decnet.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - mark lists as such; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/admin-guide/kernel-parameters.txt | 2 +- Documentation/networking/decnet.rst | 243 ++++++++++++++++++++++++ Documentation/networking/decnet.txt | 230 ---------------------- Documentation/networking/index.rst | 1 + MAINTAINERS | 2 +- net/decnet/Kconfig | 4 +- 6 files changed, 248 insertions(+), 234 deletions(-) create mode 100644 Documentation/networking/decnet.rst delete mode 100644 Documentation/networking/decnet.txt (limited to 'Documentation') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index f2a93c8679e8..b23ab11587a6 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -831,7 +831,7 @@ decnet.addr= [HW,NET] Format: [,] - See also Documentation/networking/decnet.txt. + See also Documentation/networking/decnet.rst. default_hugepagesz= [same as hugepagesz=] The size of the default diff --git a/Documentation/networking/decnet.rst b/Documentation/networking/decnet.rst new file mode 100644 index 000000000000..b8bc11ff8370 --- /dev/null +++ b/Documentation/networking/decnet.rst @@ -0,0 +1,243 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================= +Linux DECnet Networking Layer Information +========================================= + +1. Other documentation.... +========================== + + - Project Home Pages + - http://www.chygwyn.com/ - Kernel info + - http://linux-decnet.sourceforge.net/ - Userland tools + - http://www.sourceforge.net/projects/linux-decnet/ - Status page + +2. Configuring the kernel +========================= + +Be sure to turn on the following options: + + - CONFIG_DECNET (obviously) + - CONFIG_PROC_FS (to see what's going on) + - CONFIG_SYSCTL (for easy configuration) + +if you want to try out router support (not properly debugged yet) +you'll need the following options as well... + + - CONFIG_DECNET_ROUTER (to be able to add/delete routes) + - CONFIG_NETFILTER (will be required for the DECnet routing daemon) + +Don't turn on SIOCGIFCONF support for DECnet unless you are really sure +that you need it, in general you won't and it can cause ifconfig to +malfunction. + +Run time configuration has changed slightly from the 2.4 system. If you +want to configure an endnode, then the simplified procedure is as follows: + + - Set the MAC address on your ethernet card before starting _any_ other + network protocols. + +As soon as your network card is brought into the UP state, DECnet should +start working. If you need something more complicated or are unsure how +to set the MAC address, see the next section. Also all configurations which +worked with 2.4 will work under 2.5 with no change. + +3. Command line options +======================= + +You can set a DECnet address on the kernel command line for compatibility +with the 2.4 configuration procedure, but in general it's not needed any more. +If you do st a DECnet address on the command line, it has only one purpose +which is that its added to the addresses on the loopback device. + +With 2.4 kernels, DECnet would only recognise addresses as local if they +were added to the loopback device. In 2.5, any local interface address +can be used to loop back to the local machine. Of course this does not +prevent you adding further addresses to the loopback device if you +want to. + +N.B. Since the address list of an interface determines the addresses for +which "hello" messages are sent, if you don't set an address on the loopback +interface then you won't see any entries in /proc/net/neigh for the local +host until such time as you start a connection. This doesn't affect the +operation of the local communications in any other way though. + +The kernel command line takes options looking like the following:: + + decnet.addr=1,2 + +the two numbers are the node address 1,2 = 1.2 For 2.2.xx kernels +and early 2.3.xx kernels, you must use a comma when specifying the +DECnet address like this. For more recent 2.3.xx kernels, you may +use almost any character except space, although a `.` would be the most +obvious choice :-) + +There used to be a third number specifying the node type. This option +has gone away in favour of a per interface node type. This is now set +using /proc/sys/net/decnet/conf//forwarding. This file can be +set with a single digit, 0=EndNode, 1=L1 Router and 2=L2 Router. + +There are also equivalent options for modules. The node address can +also be set through the /proc/sys/net/decnet/ files, as can other system +parameters. + +Currently the only supported devices are ethernet and ip_gre. The +ethernet address of your ethernet card has to be set according to the DECnet +address of the node in order for it to be autoconfigured (and then appear in +/proc/net/decnet_dev). There is a utility available at the above +FTP sites called dn2ethaddr which can compute the correct ethernet +address to use. The address can be set by ifconfig either before or +at the time the device is brought up. If you are using RedHat you can +add the line:: + + MACADDR=AA:00:04:00:03:04 + +or something similar, to /etc/sysconfig/network-scripts/ifcfg-eth0 or +wherever your network card's configuration lives. Setting the MAC address +of your ethernet card to an address starting with "hi-ord" will cause a +DECnet address which matches to be added to the interface (which you can +verify with iproute2). + +The default device for routing can be set through the /proc filesystem +by setting /proc/sys/net/decnet/default_device to the +device you want DECnet to route packets out of when no specific route +is available. Usually this will be eth0, for example:: + + echo -n "eth0" >/proc/sys/net/decnet/default_device + +If you don't set the default device, then it will default to the first +ethernet card which has been autoconfigured as described above. You can +confirm that by looking in the default_device file of course. + +There is a list of what the other files under /proc/sys/net/decnet/ do +on the kernel patch web site (shown above). + +4. Run time kernel configuration +================================ + + +This is either done through the sysctl/proc interface (see the kernel web +pages for details on what the various options do) or through the iproute2 +package in the same way as IPv4/6 configuration is performed. + +Documentation for iproute2 is included with the package, although there is +as yet no specific section on DECnet, most of the features apply to both +IP and DECnet, albeit with DECnet addresses instead of IP addresses and +a reduced functionality. + +If you want to configure a DECnet router you'll need the iproute2 package +since its the _only_ way to add and delete routes currently. Eventually +there will be a routing daemon to send and receive routing messages for +each interface and update the kernel routing tables accordingly. The +routing daemon will use netfilter to listen to routing packets, and +rtnetlink to update the kernels routing tables. + +The DECnet raw socket layer has been removed since it was there purely +for use by the routing daemon which will now use netfilter (a much cleaner +and more generic solution) instead. + +5. How can I tell if its working? +================================= + +Here is a quick guide of what to look for in order to know if your DECnet +kernel subsystem is working. + + - Is the node address set (see /proc/sys/net/decnet/node_address) + - Is the node of the correct type + (see /proc/sys/net/decnet/conf//forwarding) + - Is the Ethernet MAC address of each Ethernet card set to match + the DECnet address. If in doubt use the dn2ethaddr utility available + at the ftp archive. + - If the previous two steps are satisfied, and the Ethernet card is up, + you should find that it is listed in /proc/net/decnet_dev and also + that it appears as a directory in /proc/sys/net/decnet/conf/. The + loopback device (lo) should also appear and is required to communicate + within a node. + - If you have any DECnet routers on your network, they should appear + in /proc/net/decnet_neigh, otherwise this file will only contain the + entry for the node itself (if it doesn't check to see if lo is up). + - If you want to send to any node which is not listed in the + /proc/net/decnet_neigh file, you'll need to set the default device + to point to an Ethernet card with connection to a router. This is + again done with the /proc/sys/net/decnet/default_device file. + - Try starting a simple server and client, like the dnping/dnmirror + over the loopback interface. With luck they should communicate. + For this step and those after, you'll need the DECnet library + which can be obtained from the above ftp sites as well as the + actual utilities themselves. + - If this seems to work, then try talking to a node on your local + network, and see if you can obtain the same results. + - At this point you are on your own... :-) + +6. How to send a bug report +=========================== + +If you've found a bug and want to report it, then there are several things +you can do to help me work out exactly what it is that is wrong. Useful +information (_most_ of which _is_ _essential_) includes: + + - What kernel version are you running ? + - What version of the patch are you running ? + - How far though the above set of tests can you get ? + - What is in the /proc/decnet* files and /proc/sys/net/decnet/* files ? + - Which services are you running ? + - Which client caused the problem ? + - How much data was being transferred ? + - Was the network congested ? + - How can the problem be reproduced ? + - Can you use tcpdump to get a trace ? (N.B. Most (all?) versions of + tcpdump don't understand how to dump DECnet properly, so including + the hex listing of the packet contents is _essential_, usually the -x flag. + You may also need to increase the length grabbed with the -s flag. The + -e flag also provides very useful information (ethernet MAC addresses)) + +7. MAC FAQ +========== + +A quick FAQ on ethernet MAC addresses to explain how Linux and DECnet +interact and how to get the best performance from your hardware. + +Ethernet cards are designed to normally only pass received network frames +to a host computer when they are addressed to it, or to the broadcast address. + +Linux has an interface which allows the setting of extra addresses for +an ethernet card to listen to. If the ethernet card supports it, the +filtering operation will be done in hardware, if not the extra unwanted packets +received will be discarded by the host computer. In the latter case, +significant processor time and bus bandwidth can be used up on a busy +network (see the NAPI documentation for a longer explanation of these +effects). + +DECnet makes use of this interface to allow running DECnet on an ethernet +card which has already been configured using TCP/IP (presumably using the +built in MAC address of the card, as usual) and/or to allow multiple DECnet +addresses on each physical interface. If you do this, be aware that if your +ethernet card doesn't support perfect hashing in its MAC address filter +then your computer will be doing more work than required. Some cards +will simply set themselves into promiscuous mode in order to receive +packets from the DECnet specified addresses. So if you have one of these +cards its better to set the MAC address of the card as described above +to gain the best efficiency. Better still is to use a card which supports +NAPI as well. + + +8. Mailing list +=============== + +If you are keen to get involved in development, or want to ask questions +about configuration, or even just report bugs, then there is a mailing +list that you can join, details are at: + +http://sourceforge.net/mail/?group_id=4993 + +9. Legal Info +============= + +The Linux DECnet project team have placed their code under the GPL. The +software is provided "as is" and without warranty express or implied. +DECnet is a trademark of Compaq. This software is not a product of +Compaq. We acknowledge the help of people at Compaq in providing extra +documentation above and beyond what was previously publicly available. + +Steve Whitehouse + diff --git a/Documentation/networking/decnet.txt b/Documentation/networking/decnet.txt deleted file mode 100644 index d192f8b9948b..000000000000 --- a/Documentation/networking/decnet.txt +++ /dev/null @@ -1,230 +0,0 @@ - Linux DECnet Networking Layer Information - =========================================== - -1) Other documentation.... - - o Project Home Pages - http://www.chygwyn.com/ - Kernel info - http://linux-decnet.sourceforge.net/ - Userland tools - http://www.sourceforge.net/projects/linux-decnet/ - Status page - -2) Configuring the kernel - -Be sure to turn on the following options: - - CONFIG_DECNET (obviously) - CONFIG_PROC_FS (to see what's going on) - CONFIG_SYSCTL (for easy configuration) - -if you want to try out router support (not properly debugged yet) -you'll need the following options as well... - - CONFIG_DECNET_ROUTER (to be able to add/delete routes) - CONFIG_NETFILTER (will be required for the DECnet routing daemon) - -Don't turn on SIOCGIFCONF support for DECnet unless you are really sure -that you need it, in general you won't and it can cause ifconfig to -malfunction. - -Run time configuration has changed slightly from the 2.4 system. If you -want to configure an endnode, then the simplified procedure is as follows: - - o Set the MAC address on your ethernet card before starting _any_ other - network protocols. - -As soon as your network card is brought into the UP state, DECnet should -start working. If you need something more complicated or are unsure how -to set the MAC address, see the next section. Also all configurations which -worked with 2.4 will work under 2.5 with no change. - -3) Command line options - -You can set a DECnet address on the kernel command line for compatibility -with the 2.4 configuration procedure, but in general it's not needed any more. -If you do st a DECnet address on the command line, it has only one purpose -which is that its added to the addresses on the loopback device. - -With 2.4 kernels, DECnet would only recognise addresses as local if they -were added to the loopback device. In 2.5, any local interface address -can be used to loop back to the local machine. Of course this does not -prevent you adding further addresses to the loopback device if you -want to. - -N.B. Since the address list of an interface determines the addresses for -which "hello" messages are sent, if you don't set an address on the loopback -interface then you won't see any entries in /proc/net/neigh for the local -host until such time as you start a connection. This doesn't affect the -operation of the local communications in any other way though. - -The kernel command line takes options looking like the following: - - decnet.addr=1,2 - -the two numbers are the node address 1,2 = 1.2 For 2.2.xx kernels -and early 2.3.xx kernels, you must use a comma when specifying the -DECnet address like this. For more recent 2.3.xx kernels, you may -use almost any character except space, although a `.` would be the most -obvious choice :-) - -There used to be a third number specifying the node type. This option -has gone away in favour of a per interface node type. This is now set -using /proc/sys/net/decnet/conf//forwarding. This file can be -set with a single digit, 0=EndNode, 1=L1 Router and 2=L2 Router. - -There are also equivalent options for modules. The node address can -also be set through the /proc/sys/net/decnet/ files, as can other system -parameters. - -Currently the only supported devices are ethernet and ip_gre. The -ethernet address of your ethernet card has to be set according to the DECnet -address of the node in order for it to be autoconfigured (and then appear in -/proc/net/decnet_dev). There is a utility available at the above -FTP sites called dn2ethaddr which can compute the correct ethernet -address to use. The address can be set by ifconfig either before or -at the time the device is brought up. If you are using RedHat you can -add the line: - - MACADDR=AA:00:04:00:03:04 - -or something similar, to /etc/sysconfig/network-scripts/ifcfg-eth0 or -wherever your network card's configuration lives. Setting the MAC address -of your ethernet card to an address starting with "hi-ord" will cause a -DECnet address which matches to be added to the interface (which you can -verify with iproute2). - -The default device for routing can be set through the /proc filesystem -by setting /proc/sys/net/decnet/default_device to the -device you want DECnet to route packets out of when no specific route -is available. Usually this will be eth0, for example: - - echo -n "eth0" >/proc/sys/net/decnet/default_device - -If you don't set the default device, then it will default to the first -ethernet card which has been autoconfigured as described above. You can -confirm that by looking in the default_device file of course. - -There is a list of what the other files under /proc/sys/net/decnet/ do -on the kernel patch web site (shown above). - -4) Run time kernel configuration - -This is either done through the sysctl/proc interface (see the kernel web -pages for details on what the various options do) or through the iproute2 -package in the same way as IPv4/6 configuration is performed. - -Documentation for iproute2 is included with the package, although there is -as yet no specific section on DECnet, most of the features apply to both -IP and DECnet, albeit with DECnet addresses instead of IP addresses and -a reduced functionality. - -If you want to configure a DECnet router you'll need the iproute2 package -since its the _only_ way to add and delete routes currently. Eventually -there will be a routing daemon to send and receive routing messages for -each interface and update the kernel routing tables accordingly. The -routing daemon will use netfilter to listen to routing packets, and -rtnetlink to update the kernels routing tables. - -The DECnet raw socket layer has been removed since it was there purely -for use by the routing daemon which will now use netfilter (a much cleaner -and more generic solution) instead. - -5) How can I tell if its working ? - -Here is a quick guide of what to look for in order to know if your DECnet -kernel subsystem is working. - - - Is the node address set (see /proc/sys/net/decnet/node_address) - - Is the node of the correct type - (see /proc/sys/net/decnet/conf//forwarding) - - Is the Ethernet MAC address of each Ethernet card set to match - the DECnet address. If in doubt use the dn2ethaddr utility available - at the ftp archive. - - If the previous two steps are satisfied, and the Ethernet card is up, - you should find that it is listed in /proc/net/decnet_dev and also - that it appears as a directory in /proc/sys/net/decnet/conf/. The - loopback device (lo) should also appear and is required to communicate - within a node. - - If you have any DECnet routers on your network, they should appear - in /proc/net/decnet_neigh, otherwise this file will only contain the - entry for the node itself (if it doesn't check to see if lo is up). - - If you want to send to any node which is not listed in the - /proc/net/decnet_neigh file, you'll need to set the default device - to point to an Ethernet card with connection to a router. This is - again done with the /proc/sys/net/decnet/default_device file. - - Try starting a simple server and client, like the dnping/dnmirror - over the loopback interface. With luck they should communicate. - For this step and those after, you'll need the DECnet library - which can be obtained from the above ftp sites as well as the - actual utilities themselves. - - If this seems to work, then try talking to a node on your local - network, and see if you can obtain the same results. - - At this point you are on your own... :-) - -6) How to send a bug report - -If you've found a bug and want to report it, then there are several things -you can do to help me work out exactly what it is that is wrong. Useful -information (_most_ of which _is_ _essential_) includes: - - - What kernel version are you running ? - - What version of the patch are you running ? - - How far though the above set of tests can you get ? - - What is in the /proc/decnet* files and /proc/sys/net/decnet/* files ? - - Which services are you running ? - - Which client caused the problem ? - - How much data was being transferred ? - - Was the network congested ? - - How can the problem be reproduced ? - - Can you use tcpdump to get a trace ? (N.B. Most (all?) versions of - tcpdump don't understand how to dump DECnet properly, so including - the hex listing of the packet contents is _essential_, usually the -x flag. - You may also need to increase the length grabbed with the -s flag. The - -e flag also provides very useful information (ethernet MAC addresses)) - -7) MAC FAQ - -A quick FAQ on ethernet MAC addresses to explain how Linux and DECnet -interact and how to get the best performance from your hardware. - -Ethernet cards are designed to normally only pass received network frames -to a host computer when they are addressed to it, or to the broadcast address. - -Linux has an interface which allows the setting of extra addresses for -an ethernet card to listen to. If the ethernet card supports it, the -filtering operation will be done in hardware, if not the extra unwanted packets -received will be discarded by the host computer. In the latter case, -significant processor time and bus bandwidth can be used up on a busy -network (see the NAPI documentation for a longer explanation of these -effects). - -DECnet makes use of this interface to allow running DECnet on an ethernet -card which has already been configured using TCP/IP (presumably using the -built in MAC address of the card, as usual) and/or to allow multiple DECnet -addresses on each physical interface. If you do this, be aware that if your -ethernet card doesn't support perfect hashing in its MAC address filter -then your computer will be doing more work than required. Some cards -will simply set themselves into promiscuous mode in order to receive -packets from the DECnet specified addresses. So if you have one of these -cards its better to set the MAC address of the card as described above -to gain the best efficiency. Better still is to use a card which supports -NAPI as well. - - -8) Mailing list - -If you are keen to get involved in development, or want to ask questions -about configuration, or even just report bugs, then there is a mailing -list that you can join, details are at: - -http://sourceforge.net/mail/?group_id=4993 - -9) Legal Info - -The Linux DECnet project team have placed their code under the GPL. The -software is provided "as is" and without warranty express or implied. -DECnet is a trademark of Compaq. This software is not a product of -Compaq. We acknowledge the help of people at Compaq in providing extra -documentation above and beyond what was previously publicly available. - -Steve Whitehouse - diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 9e83d3bda4e0..e17432492745 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -50,6 +50,7 @@ Contents: cxacru dccp dctcp + decnet .. only:: subproject and html diff --git a/MAINTAINERS b/MAINTAINERS index 453fe0713e68..7323bfc1720f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -4728,7 +4728,7 @@ DECnet NETWORK LAYER L: linux-decnet-user@lists.sourceforge.net S: Orphan W: http://linux-decnet.sourceforge.net -F: Documentation/networking/decnet.txt +F: Documentation/networking/decnet.rst F: net/decnet/ DECSTATION PLATFORM SUPPORT diff --git a/net/decnet/Kconfig b/net/decnet/Kconfig index 0935453ccfd5..8f98fb2f2ec9 100644 --- a/net/decnet/Kconfig +++ b/net/decnet/Kconfig @@ -15,7 +15,7 @@ config DECNET . More detailed documentation is available in - . + . Be sure to say Y to "/proc file system support" and "Sysctl support" below when using DECnet, since you will need sysctl support to aid @@ -40,4 +40,4 @@ config DECNET_ROUTER filtering" option will be required for the forthcoming routing daemon to work. - See for more information. + See for more information. -- cgit v1.2.3 From 5f32c920c23b75654a839aa87c344b2bcaf350e2 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:31 +0200 Subject: docs: networking: convert defza.txt to ReST Not much to be done here: - add SPDX header; - add a document title; - use :field: markup for the version number; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/defza.rst | 63 ++++++++++++++++++++++++++++++++++++++ Documentation/networking/defza.txt | 57 ---------------------------------- Documentation/networking/index.rst | 1 + 3 files changed, 64 insertions(+), 57 deletions(-) create mode 100644 Documentation/networking/defza.rst delete mode 100644 Documentation/networking/defza.txt (limited to 'Documentation') diff --git a/Documentation/networking/defza.rst b/Documentation/networking/defza.rst new file mode 100644 index 000000000000..73c2f793ea26 --- /dev/null +++ b/Documentation/networking/defza.rst @@ -0,0 +1,63 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================================================== +Notes on the DEC FDDIcontroller 700 (DEFZA-xx) driver +===================================================== + +:Version: v.1.1.4 + + +DEC FDDIcontroller 700 is DEC's first-generation TURBOchannel FDDI +network card, designed in 1990 specifically for the DECstation 5000 +model 200 workstation. The board is a single attachment station and +it was manufactured in two variations, both of which are supported. + +First is the SAS MMF DEFZA-AA option, the original design implementing +the standard MMF-PMD, however with a pair of ST connectors rather than +the usual MIC connector. The other one is the SAS ThinWire/STP DEFZA-CA +option, denoted 700-C, with the network medium selectable by a switch +between the DEC proprietary ThinWire-PMD using a BNC connector and the +standard STP-PMD using a DE-9F connector. This option can interface to +a DECconcentrator 500 device and, in the case of the STP-PMD, also other +FDDI equipment and was designed to make it easier to transition from +existing IEEE 802.3 10BASE2 Ethernet and IEEE 802.5 Token Ring networks +by providing means to reuse existing cabling. + +This driver handles any number of cards installed in a single system. +They get fddi0, fddi1, etc. interface names assigned in the order of +increasing TURBOchannel slot numbers. + +The board only supports DMA on the receive side. Transmission involves +the use of PIO. As a result under a heavy transmission load there will +be a significant impact on system performance. + +The board supports a 64-entry CAM for matching destination addresses. +Two entries are preoccupied by the Directed Beacon and Ring Purger +multicast addresses and the rest is used as a multicast filter. An +all-multi mode is also supported for LLC frames and it is used if +requested explicitly or if the CAM overflows. The promiscuous mode +supports separate enables for LLC and SMT frames, but this driver +doesn't support changing them individually. + + +Known problems: + +None. + + +To do: + +5. MAC address change. The card does not support changing the Media + Access Controller's address registers but a similar effect can be + achieved by adding an alias to the CAM. There is no way to disable + matching against the original address though. + +7. Queueing incoming/outgoing SMT frames in the driver if the SMT + receive/RMC transmit ring is full. (?) + +8. Retrieving/reporting FDDI/SNMP stats. + + +Both success and failure reports are welcome. + +Maciej W. Rozycki diff --git a/Documentation/networking/defza.txt b/Documentation/networking/defza.txt deleted file mode 100644 index 663e4a906751..000000000000 --- a/Documentation/networking/defza.txt +++ /dev/null @@ -1,57 +0,0 @@ -Notes on the DEC FDDIcontroller 700 (DEFZA-xx) driver v.1.1.4. - - -DEC FDDIcontroller 700 is DEC's first-generation TURBOchannel FDDI -network card, designed in 1990 specifically for the DECstation 5000 -model 200 workstation. The board is a single attachment station and -it was manufactured in two variations, both of which are supported. - -First is the SAS MMF DEFZA-AA option, the original design implementing -the standard MMF-PMD, however with a pair of ST connectors rather than -the usual MIC connector. The other one is the SAS ThinWire/STP DEFZA-CA -option, denoted 700-C, with the network medium selectable by a switch -between the DEC proprietary ThinWire-PMD using a BNC connector and the -standard STP-PMD using a DE-9F connector. This option can interface to -a DECconcentrator 500 device and, in the case of the STP-PMD, also other -FDDI equipment and was designed to make it easier to transition from -existing IEEE 802.3 10BASE2 Ethernet and IEEE 802.5 Token Ring networks -by providing means to reuse existing cabling. - -This driver handles any number of cards installed in a single system. -They get fddi0, fddi1, etc. interface names assigned in the order of -increasing TURBOchannel slot numbers. - -The board only supports DMA on the receive side. Transmission involves -the use of PIO. As a result under a heavy transmission load there will -be a significant impact on system performance. - -The board supports a 64-entry CAM for matching destination addresses. -Two entries are preoccupied by the Directed Beacon and Ring Purger -multicast addresses and the rest is used as a multicast filter. An -all-multi mode is also supported for LLC frames and it is used if -requested explicitly or if the CAM overflows. The promiscuous mode -supports separate enables for LLC and SMT frames, but this driver -doesn't support changing them individually. - - -Known problems: - -None. - - -To do: - -5. MAC address change. The card does not support changing the Media - Access Controller's address registers but a similar effect can be - achieved by adding an alias to the CAM. There is no way to disable - matching against the original address though. - -7. Queueing incoming/outgoing SMT frames in the driver if the SMT - receive/RMC transmit ring is full. (?) - -8. Retrieving/reporting FDDI/SNMP stats. - - -Both success and failure reports are welcome. - -Maciej W. Rozycki diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index e17432492745..c893127004b9 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -51,6 +51,7 @@ Contents: dccp dctcp decnet + defza .. only:: subproject and html -- cgit v1.2.3 From 9dfe1361261be48c92fd7cb26909cbcd5d496220 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:32 +0200 Subject: docs: networking: convert dns_resolver.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - comment out text-only TOC from html/pdf output; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/dns_resolver.rst | 155 +++++++++++++++++++++++++++++ Documentation/networking/dns_resolver.txt | 157 ------------------------------ Documentation/networking/index.rst | 1 + net/ceph/Kconfig | 2 +- net/dns_resolver/Kconfig | 2 +- net/dns_resolver/dns_key.c | 2 +- net/dns_resolver/dns_query.c | 2 +- 7 files changed, 160 insertions(+), 161 deletions(-) create mode 100644 Documentation/networking/dns_resolver.rst delete mode 100644 Documentation/networking/dns_resolver.txt (limited to 'Documentation') diff --git a/Documentation/networking/dns_resolver.rst b/Documentation/networking/dns_resolver.rst new file mode 100644 index 000000000000..add4d59a99a5 --- /dev/null +++ b/Documentation/networking/dns_resolver.rst @@ -0,0 +1,155 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================== +DNS Resolver Module +=================== + +.. Contents: + + - Overview. + - Compilation. + - Setting up. + - Usage. + - Mechanism. + - Debugging. + + +Overview +======== + +The DNS resolver module provides a way for kernel services to make DNS queries +by way of requesting a key of key type dns_resolver. These queries are +upcalled to userspace through /sbin/request-key. + +These routines must be supported by userspace tools dns.upcall, cifs.upcall and +request-key. It is under development and does not yet provide the full feature +set. The features it does support include: + + (*) Implements the dns_resolver key_type to contact userspace. + +It does not yet support the following AFS features: + + (*) Dns query support for AFSDB resource record. + +This code is extracted from the CIFS filesystem. + + +Compilation +=========== + +The module should be enabled by turning on the kernel configuration options:: + + CONFIG_DNS_RESOLVER - tristate "DNS Resolver support" + + +Setting up +========== + +To set up this facility, the /etc/request-key.conf file must be altered so that +/sbin/request-key can appropriately direct the upcalls. For example, to handle +basic dname to IPv4/IPv6 address resolution, the following line should be +added:: + + + #OP TYPE DESC CO-INFO PROGRAM ARG1 ARG2 ARG3 ... + #====== ============ ======= ======= ========================== + create dns_resolver * * /usr/sbin/cifs.upcall %k + +To direct a query for query type 'foo', a line of the following should be added +before the more general line given above as the first match is the one taken:: + + create dns_resolver foo:* * /usr/sbin/dns.foo %k + + +Usage +===== + +To make use of this facility, one of the following functions that are +implemented in the module can be called after doing:: + + #include + + :: + + int dns_query(const char *type, const char *name, size_t namelen, + const char *options, char **_result, time_t *_expiry); + + This is the basic access function. It looks for a cached DNS query and if + it doesn't find it, it upcalls to userspace to make a new DNS query, which + may then be cached. The key description is constructed as a string of the + form:: + + [:] + + where optionally specifies the particular upcall program to invoke, + and thus the type of query to do, and specifies the string to be + looked up. The default query type is a straight hostname to IP address + set lookup. + + The name parameter is not required to be a NUL-terminated string, and its + length should be given by the namelen argument. + + The options parameter may be NULL or it may be a set of options + appropriate to the query type. + + The return value is a string appropriate to the query type. For instance, + for the default query type it is just a list of comma-separated IPv4 and + IPv6 addresses. The caller must free the result. + + The length of the result string is returned on success, and a negative + error code is returned otherwise. -EKEYREJECTED will be returned if the + DNS lookup failed. + + If _expiry is non-NULL, the expiry time (TTL) of the result will be + returned also. + +The kernel maintains an internal keyring in which it caches looked up keys. +This can be cleared by any process that has the CAP_SYS_ADMIN capability by +the use of KEYCTL_KEYRING_CLEAR on the keyring ID. + + +Reading DNS Keys from Userspace +=============================== + +Keys of dns_resolver type can be read from userspace using keyctl_read() or +"keyctl read/print/pipe". + + +Mechanism +========= + +The dnsresolver module registers a key type called "dns_resolver". Keys of +this type are used to transport and cache DNS lookup results from userspace. + +When dns_query() is invoked, it calls request_key() to search the local +keyrings for a cached DNS result. If that fails to find one, it upcalls to +userspace to get a new result. + +Upcalls to userspace are made through the request_key() upcall vector, and are +directed by means of configuration lines in /etc/request-key.conf that tell +/sbin/request-key what program to run to instantiate the key. + +The upcall handler program is responsible for querying the DNS, processing the +result into a form suitable for passing to the keyctl_instantiate_key() +routine. This then passes the data to dns_resolver_instantiate() which strips +off and processes any options included in the data, and then attaches the +remainder of the string to the key as its payload. + +The upcall handler program should set the expiry time on the key to that of the +lowest TTL of all the records it has extracted a result from. This means that +the key will be discarded and recreated when the data it holds has expired. + +dns_query() returns a copy of the value attached to the key, or an error if +that is indicated instead. + +See for further +information about request-key function. + + +Debugging +========= + +Debugging messages can be turned on dynamically by writing a 1 into the +following file:: + + /sys/module/dnsresolver/parameters/debug diff --git a/Documentation/networking/dns_resolver.txt b/Documentation/networking/dns_resolver.txt deleted file mode 100644 index eaa8f9a6fd5d..000000000000 --- a/Documentation/networking/dns_resolver.txt +++ /dev/null @@ -1,157 +0,0 @@ - =================== - DNS Resolver Module - =================== - -Contents: - - - Overview. - - Compilation. - - Setting up. - - Usage. - - Mechanism. - - Debugging. - - -======== -OVERVIEW -======== - -The DNS resolver module provides a way for kernel services to make DNS queries -by way of requesting a key of key type dns_resolver. These queries are -upcalled to userspace through /sbin/request-key. - -These routines must be supported by userspace tools dns.upcall, cifs.upcall and -request-key. It is under development and does not yet provide the full feature -set. The features it does support include: - - (*) Implements the dns_resolver key_type to contact userspace. - -It does not yet support the following AFS features: - - (*) Dns query support for AFSDB resource record. - -This code is extracted from the CIFS filesystem. - - -=========== -COMPILATION -=========== - -The module should be enabled by turning on the kernel configuration options: - - CONFIG_DNS_RESOLVER - tristate "DNS Resolver support" - - -========== -SETTING UP -========== - -To set up this facility, the /etc/request-key.conf file must be altered so that -/sbin/request-key can appropriately direct the upcalls. For example, to handle -basic dname to IPv4/IPv6 address resolution, the following line should be -added: - - #OP TYPE DESC CO-INFO PROGRAM ARG1 ARG2 ARG3 ... - #====== ============ ======= ======= ========================== - create dns_resolver * * /usr/sbin/cifs.upcall %k - -To direct a query for query type 'foo', a line of the following should be added -before the more general line given above as the first match is the one taken. - - create dns_resolver foo:* * /usr/sbin/dns.foo %k - - -===== -USAGE -===== - -To make use of this facility, one of the following functions that are -implemented in the module can be called after doing: - - #include - - (1) int dns_query(const char *type, const char *name, size_t namelen, - const char *options, char **_result, time_t *_expiry); - - This is the basic access function. It looks for a cached DNS query and if - it doesn't find it, it upcalls to userspace to make a new DNS query, which - may then be cached. The key description is constructed as a string of the - form: - - [:] - - where optionally specifies the particular upcall program to invoke, - and thus the type of query to do, and specifies the string to be - looked up. The default query type is a straight hostname to IP address - set lookup. - - The name parameter is not required to be a NUL-terminated string, and its - length should be given by the namelen argument. - - The options parameter may be NULL or it may be a set of options - appropriate to the query type. - - The return value is a string appropriate to the query type. For instance, - for the default query type it is just a list of comma-separated IPv4 and - IPv6 addresses. The caller must free the result. - - The length of the result string is returned on success, and a negative - error code is returned otherwise. -EKEYREJECTED will be returned if the - DNS lookup failed. - - If _expiry is non-NULL, the expiry time (TTL) of the result will be - returned also. - -The kernel maintains an internal keyring in which it caches looked up keys. -This can be cleared by any process that has the CAP_SYS_ADMIN capability by -the use of KEYCTL_KEYRING_CLEAR on the keyring ID. - - -=============================== -READING DNS KEYS FROM USERSPACE -=============================== - -Keys of dns_resolver type can be read from userspace using keyctl_read() or -"keyctl read/print/pipe". - - -========= -MECHANISM -========= - -The dnsresolver module registers a key type called "dns_resolver". Keys of -this type are used to transport and cache DNS lookup results from userspace. - -When dns_query() is invoked, it calls request_key() to search the local -keyrings for a cached DNS result. If that fails to find one, it upcalls to -userspace to get a new result. - -Upcalls to userspace are made through the request_key() upcall vector, and are -directed by means of configuration lines in /etc/request-key.conf that tell -/sbin/request-key what program to run to instantiate the key. - -The upcall handler program is responsible for querying the DNS, processing the -result into a form suitable for passing to the keyctl_instantiate_key() -routine. This then passes the data to dns_resolver_instantiate() which strips -off and processes any options included in the data, and then attaches the -remainder of the string to the key as its payload. - -The upcall handler program should set the expiry time on the key to that of the -lowest TTL of all the records it has extracted a result from. This means that -the key will be discarded and recreated when the data it holds has expired. - -dns_query() returns a copy of the value attached to the key, or an error if -that is indicated instead. - -See for further -information about request-key function. - - -========= -DEBUGGING -========= - -Debugging messages can be turned on dynamically by writing a 1 into the -following file: - - /sys/module/dnsresolver/parameters/debug diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index c893127004b9..55746038a7e9 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -52,6 +52,7 @@ Contents: dctcp decnet defza + dns_resolver .. only:: subproject and html diff --git a/net/ceph/Kconfig b/net/ceph/Kconfig index 2e8e6f904920..d7bec7adc267 100644 --- a/net/ceph/Kconfig +++ b/net/ceph/Kconfig @@ -39,6 +39,6 @@ config CEPH_LIB_USE_DNS_RESOLVER be resolved using the CONFIG_DNS_RESOLVER facility. For information on how to use CONFIG_DNS_RESOLVER consult - Documentation/networking/dns_resolver.txt + Documentation/networking/dns_resolver.rst If unsure, say N. diff --git a/net/dns_resolver/Kconfig b/net/dns_resolver/Kconfig index 0a1c2238b4bd..255df9b6e9e8 100644 --- a/net/dns_resolver/Kconfig +++ b/net/dns_resolver/Kconfig @@ -19,7 +19,7 @@ config DNS_RESOLVER SMB2 later. DNS Resolver is supported by the userspace upcall helper "/sbin/dns.resolver" via /etc/request-key.conf. - See for further + See for further information. To compile this as a module, choose M here: the module will be called diff --git a/net/dns_resolver/dns_key.c b/net/dns_resolver/dns_key.c index ad53eb31d40f..3aced951d5ab 100644 --- a/net/dns_resolver/dns_key.c +++ b/net/dns_resolver/dns_key.c @@ -1,6 +1,6 @@ /* Key type used to cache DNS lookups made by the kernel * - * See Documentation/networking/dns_resolver.txt + * See Documentation/networking/dns_resolver.rst * * Copyright (c) 2007 Igor Mammedov * Author(s): Igor Mammedov (niallain@gmail.com) diff --git a/net/dns_resolver/dns_query.c b/net/dns_resolver/dns_query.c index cab4e0df924f..82b084cc1cc6 100644 --- a/net/dns_resolver/dns_query.c +++ b/net/dns_resolver/dns_query.c @@ -1,7 +1,7 @@ /* Upcall routine, designed to work as a key type and working through * /sbin/request-key to contact userspace when handling DNS queries. * - * See Documentation/networking/dns_resolver.txt + * See Documentation/networking/dns_resolver.rst * * Copyright (c) 2007 Igor Mammedov * Author(s): Igor Mammedov (niallain@gmail.com) -- cgit v1.2.3 From 28d23311ff35ac97ff20608f47c84c95d6389c33 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:33 +0200 Subject: docs: networking: convert driver.txt to ReST - add SPDX header; - add a document title; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/driver.rst | 97 +++++++++++++++++++++++++++++++++++++ Documentation/networking/driver.txt | 93 ----------------------------------- Documentation/networking/index.rst | 1 + 3 files changed, 98 insertions(+), 93 deletions(-) create mode 100644 Documentation/networking/driver.rst delete mode 100644 Documentation/networking/driver.txt (limited to 'Documentation') diff --git a/Documentation/networking/driver.rst b/Documentation/networking/driver.rst new file mode 100644 index 000000000000..c8f59dbda46f --- /dev/null +++ b/Documentation/networking/driver.rst @@ -0,0 +1,97 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +Softnet Driver Issues +===================== + +Transmit path guidelines: + +1) The ndo_start_xmit method must not return NETDEV_TX_BUSY under + any normal circumstances. It is considered a hard error unless + there is no way your device can tell ahead of time when it's + transmit function will become busy. + + Instead it must maintain the queue properly. For example, + for a driver implementing scatter-gather this means:: + + static netdev_tx_t drv_hard_start_xmit(struct sk_buff *skb, + struct net_device *dev) + { + struct drv *dp = netdev_priv(dev); + + lock_tx(dp); + ... + /* This is a hard error log it. */ + if (TX_BUFFS_AVAIL(dp) <= (skb_shinfo(skb)->nr_frags + 1)) { + netif_stop_queue(dev); + unlock_tx(dp); + printk(KERN_ERR PFX "%s: BUG! Tx Ring full when queue awake!\n", + dev->name); + return NETDEV_TX_BUSY; + } + + ... queue packet to card ... + ... update tx consumer index ... + + if (TX_BUFFS_AVAIL(dp) <= (MAX_SKB_FRAGS + 1)) + netif_stop_queue(dev); + + ... + unlock_tx(dp); + ... + return NETDEV_TX_OK; + } + + And then at the end of your TX reclamation event handling:: + + if (netif_queue_stopped(dp->dev) && + TX_BUFFS_AVAIL(dp) > (MAX_SKB_FRAGS + 1)) + netif_wake_queue(dp->dev); + + For a non-scatter-gather supporting card, the three tests simply become:: + + /* This is a hard error log it. */ + if (TX_BUFFS_AVAIL(dp) <= 0) + + and:: + + if (TX_BUFFS_AVAIL(dp) == 0) + + and:: + + if (netif_queue_stopped(dp->dev) && + TX_BUFFS_AVAIL(dp) > 0) + netif_wake_queue(dp->dev); + +2) An ndo_start_xmit method must not modify the shared parts of a + cloned SKB. + +3) Do not forget that once you return NETDEV_TX_OK from your + ndo_start_xmit method, it is your driver's responsibility to free + up the SKB and in some finite amount of time. + + For example, this means that it is not allowed for your TX + mitigation scheme to let TX packets "hang out" in the TX + ring unreclaimed forever if no new TX packets are sent. + This error can deadlock sockets waiting for send buffer room + to be freed up. + + If you return NETDEV_TX_BUSY from the ndo_start_xmit method, you + must not keep any reference to that SKB and you must not attempt + to free it up. + +Probing guidelines: + +1) Any hardware layer address you obtain for your device should + be verified. For example, for ethernet check it with + linux/etherdevice.h:is_valid_ether_addr() + +Close/stop guidelines: + +1) After the ndo_stop routine has been called, the hardware must + not receive or transmit any data. All in flight packets must + be aborted. If necessary, poll or wait for completion of + any reset commands. + +2) The ndo_stop routine will be called by unregister_netdevice + if device is still UP. diff --git a/Documentation/networking/driver.txt b/Documentation/networking/driver.txt deleted file mode 100644 index da59e2884130..000000000000 --- a/Documentation/networking/driver.txt +++ /dev/null @@ -1,93 +0,0 @@ -Document about softnet driver issues - -Transmit path guidelines: - -1) The ndo_start_xmit method must not return NETDEV_TX_BUSY under - any normal circumstances. It is considered a hard error unless - there is no way your device can tell ahead of time when it's - transmit function will become busy. - - Instead it must maintain the queue properly. For example, - for a driver implementing scatter-gather this means: - - static netdev_tx_t drv_hard_start_xmit(struct sk_buff *skb, - struct net_device *dev) - { - struct drv *dp = netdev_priv(dev); - - lock_tx(dp); - ... - /* This is a hard error log it. */ - if (TX_BUFFS_AVAIL(dp) <= (skb_shinfo(skb)->nr_frags + 1)) { - netif_stop_queue(dev); - unlock_tx(dp); - printk(KERN_ERR PFX "%s: BUG! Tx Ring full when queue awake!\n", - dev->name); - return NETDEV_TX_BUSY; - } - - ... queue packet to card ... - ... update tx consumer index ... - - if (TX_BUFFS_AVAIL(dp) <= (MAX_SKB_FRAGS + 1)) - netif_stop_queue(dev); - - ... - unlock_tx(dp); - ... - return NETDEV_TX_OK; - } - - And then at the end of your TX reclamation event handling: - - if (netif_queue_stopped(dp->dev) && - TX_BUFFS_AVAIL(dp) > (MAX_SKB_FRAGS + 1)) - netif_wake_queue(dp->dev); - - For a non-scatter-gather supporting card, the three tests simply become: - - /* This is a hard error log it. */ - if (TX_BUFFS_AVAIL(dp) <= 0) - - and: - - if (TX_BUFFS_AVAIL(dp) == 0) - - and: - - if (netif_queue_stopped(dp->dev) && - TX_BUFFS_AVAIL(dp) > 0) - netif_wake_queue(dp->dev); - -2) An ndo_start_xmit method must not modify the shared parts of a - cloned SKB. - -3) Do not forget that once you return NETDEV_TX_OK from your - ndo_start_xmit method, it is your driver's responsibility to free - up the SKB and in some finite amount of time. - - For example, this means that it is not allowed for your TX - mitigation scheme to let TX packets "hang out" in the TX - ring unreclaimed forever if no new TX packets are sent. - This error can deadlock sockets waiting for send buffer room - to be freed up. - - If you return NETDEV_TX_BUSY from the ndo_start_xmit method, you - must not keep any reference to that SKB and you must not attempt - to free it up. - -Probing guidelines: - -1) Any hardware layer address you obtain for your device should - be verified. For example, for ethernet check it with - linux/etherdevice.h:is_valid_ether_addr() - -Close/stop guidelines: - -1) After the ndo_stop routine has been called, the hardware must - not receive or transmit any data. All in flight packets must - be aborted. If necessary, poll or wait for completion of - any reset commands. - -2) The ndo_stop routine will be called by unregister_netdevice - if device is still UP. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 55746038a7e9..313f66900bce 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -53,6 +53,7 @@ Contents: decnet defza dns_resolver + driver .. only:: subproject and html -- cgit v1.2.3 From 06df65723b69a3d4691737503654400c35f9ca5a Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:34 +0200 Subject: docs: networking: convert eql.txt to ReST - add SPDX header; - add a document title; - adjust titles and chapters, adding proper markups; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/eql.rst | 373 ++++++++++++++++++++++++++ Documentation/networking/eql.txt | 528 ------------------------------------- Documentation/networking/index.rst | 1 + drivers/net/Kconfig | 2 +- 4 files changed, 375 insertions(+), 529 deletions(-) create mode 100644 Documentation/networking/eql.rst delete mode 100644 Documentation/networking/eql.txt (limited to 'Documentation') diff --git a/Documentation/networking/eql.rst b/Documentation/networking/eql.rst new file mode 100644 index 000000000000..a628c4c81166 --- /dev/null +++ b/Documentation/networking/eql.rst @@ -0,0 +1,373 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================== +EQL Driver: Serial IP Load Balancing HOWTO +========================================== + + Simon "Guru Aleph-Null" Janes, simon@ncm.com + + v1.1, February 27, 1995 + + This is the manual for the EQL device driver. EQL is a software device + that lets you load-balance IP serial links (SLIP or uncompressed PPP) + to increase your bandwidth. It will not reduce your latency (i.e. ping + times) except in the case where you already have lots of traffic on + your link, in which it will help them out. This driver has been tested + with the 1.1.75 kernel, and is known to have patched cleanly with + 1.1.86. Some testing with 1.1.92 has been done with the v1.1 patch + which was only created to patch cleanly in the very latest kernel + source trees. (Yes, it worked fine.) + +1. Introduction +=============== + + Which is worse? A huge fee for a 56K leased line or two phone lines? + It's probably the former. If you find yourself craving more bandwidth, + and have a ISP that is flexible, it is now possible to bind modems + together to work as one point-to-point link to increase your + bandwidth. All without having to have a special black box on either + side. + + + The eql driver has only been tested with the Livingston PortMaster-2e + terminal server. I do not know if other terminal servers support load- + balancing, but I do know that the PortMaster does it, and does it + almost as well as the eql driver seems to do it (-- Unfortunately, in + my testing so far, the Livingston PortMaster 2e's load-balancing is a + good 1 to 2 KB/s slower than the test machine working with a 28.8 Kbps + and 14.4 Kbps connection. However, I am not sure that it really is + the PortMaster, or if it's Linux's TCP drivers. I'm told that Linux's + TCP implementation is pretty fast though.--) + + + I suggest to ISPs out there that it would probably be fair to charge + a load-balancing client 75% of the cost of the second line and 50% of + the cost of the third line etc... + + + Hey, we can all dream you know... + + +2. Kernel Configuration +======================= + + Here I describe the general steps of getting a kernel up and working + with the eql driver. From patching, building, to installing. + + +2.1. Patching The Kernel +------------------------ + + If you do not have or cannot get a copy of the kernel with the eql + driver folded into it, get your copy of the driver from + ftp://slaughter.ncm.com/pub/Linux/LOAD_BALANCING/eql-1.1.tar.gz. + Unpack this archive someplace obvious like /usr/local/src/. It will + create the following files:: + + -rw-r--r-- guru/ncm 198 Jan 19 18:53 1995 eql-1.1/NO-WARRANTY + -rw-r--r-- guru/ncm 30620 Feb 27 21:40 1995 eql-1.1/eql-1.1.patch + -rwxr-xr-x guru/ncm 16111 Jan 12 22:29 1995 eql-1.1/eql_enslave + -rw-r--r-- guru/ncm 2195 Jan 10 21:48 1995 eql-1.1/eql_enslave.c + + Unpack a recent kernel (something after 1.1.92) someplace convenient + like say /usr/src/linux-1.1.92.eql. Use symbolic links to point + /usr/src/linux to this development directory. + + + Apply the patch by running the commands:: + + cd /usr/src + patch + ". Here are some example enslavings:: + + eql_enslave eql sl0 28800 + eql_enslave eql ppp0 14400 + eql_enslave eql sl1 57600 + + When you want to free a device from its life of slavery, you can + either down the device with ifconfig (eql will automatically bury the + dead slave and remove it from its queue) or use eql_emancipate to free + it. (-- Or just ifconfig it down, and the eql driver will take it out + for you.--):: + + eql_emancipate eql sl0 + eql_emancipate eql ppp0 + eql_emancipate eql sl1 + + +3.3. DSLIP Configuration for the eql Device +------------------------------------------- + + The general idea is to bring up and keep up as many SLIP connections + as you need, automatically. + + +3.3.1. /etc/slip/runslip.conf +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + Here is an example runslip.conf:: + + name sl-line-1 + enabled + baud 38400 + mtu 576 + ducmd -e /etc/slip/dialout/cua2-288.xp -t 9 + command eql_enslave eql $interface 28800 + address 198.67.33.239 + line /dev/cua2 + + name sl-line-2 + enabled + baud 38400 + mtu 576 + ducmd -e /etc/slip/dialout/cua3-288.xp -t 9 + command eql_enslave eql $interface 28800 + address 198.67.33.239 + line /dev/cua3 + + +3.4. Using PPP and the eql Device +--------------------------------- + + I have not yet done any load-balancing testing for PPP devices, mainly + because I don't have a PPP-connection manager like SLIP has with + DSLIP. I did find a good tip from LinuxNET:Billy for PPP performance: + make sure you have asyncmap set to something so that control + characters are not escaped. + + + I tried to fix up a PPP script/system for redialing lost PPP + connections for use with the eql driver the weekend of Feb 25-26 '95 + (Hereafter known as the 8-hour PPP Hate Festival). Perhaps later this + year. + + +4. About the Slave Scheduler Algorithm +====================================== + + The slave scheduler probably could be replaced with a dozen other + things and push traffic much faster. The formula in the current set + up of the driver was tuned to handle slaves with wildly different + bits-per-second "priorities". + + + All testing I have done was with two 28.8 V.FC modems, one connecting + at 28800 bps or slower, and the other connecting at 14400 bps all the + time. + + + One version of the scheduler was able to push 5.3 K/s through the + 28800 and 14400 connections, but when the priorities on the links were + very wide apart (57600 vs. 14400) the "faster" modem received all + traffic and the "slower" modem starved. + + +5. Testers' Reports +=================== + + Some people have experimented with the eql device with newer + kernels (than 1.1.75). I have since updated the driver to patch + cleanly in newer kernels because of the removal of the old "slave- + balancing" driver config option. + + + - icee from LinuxNET patched 1.1.86 without any rejects and was able + to boot the kernel and enslave a couple of ISDN PPP links. + +5.1. Randolph Bentson's Test Report +----------------------------------- + + :: + + From bentson@grieg.seaslug.org Wed Feb 8 19:08:09 1995 + Date: Tue, 7 Feb 95 22:57 PST + From: Randolph Bentson + To: guru@ncm.com + Subject: EQL driver tests + + + I have been checking out your eql driver. (Nice work, that!) + Although you may already done this performance testing, here + are some data I've discovered. + + Randolph Bentson + bentson@grieg.seaslug.org + +------------------------------------------------------------------ + + + A pseudo-device driver, EQL, written by Simon Janes, can be used + to bundle multiple SLIP connections into what appears to be a + single connection. This allows one to improve dial-up network + connectivity gradually, without having to buy expensive DSU/CSU + hardware and services. + + I have done some testing of this software, with two goals in + mind: first, to ensure it actually works as described and + second, as a method of exercising my device driver. + + The following performance measurements were derived from a set + of SLIP connections run between two Linux systems (1.1.84) using + a 486DX2/66 with a Cyclom-8Ys and a 486SLC/40 with a Cyclom-16Y. + (Ports 0,1,2,3 were used. A later configuration will distribute + port selection across the different Cirrus chips on the boards.) + Once a link was established, I timed a binary ftp transfer of + 289284 bytes of data. If there were no overhead (packet headers, + inter-character and inter-packet delays, etc.) the transfers + would take the following times:: + + bits/sec seconds + 345600 8.3 + 234600 12.3 + 172800 16.7 + 153600 18.8 + 76800 37.6 + 57600 50.2 + 38400 75.3 + 28800 100.4 + 19200 150.6 + 9600 301.3 + + A single line running at the lower speeds and with large packets + comes to within 2% of this. Performance is limited for the higher + speeds (as predicted by the Cirrus databook) to an aggregate of + about 160 kbits/sec. The next round of testing will distribute + the load across two or more Cirrus chips. + + The good news is that one gets nearly the full advantage of the + second, third, and fourth line's bandwidth. (The bad news is + that the connection establishment seemed fragile for the higher + speeds. Once established, the connection seemed robust enough.) + + ====== ======== === ======== ======= ======= === + #lines speed mtu seconds theory actual %of + kbit/sec duration speed speed max + ====== ======== === ======== ======= ======= === + 3 115200 900 _ 345600 + 3 115200 400 18.1 345600 159825 46 + 2 115200 900 _ 230400 + 2 115200 600 18.1 230400 159825 69 + 2 115200 400 19.3 230400 149888 65 + 4 57600 900 _ 234600 + 4 57600 600 _ 234600 + 4 57600 400 _ 234600 + 3 57600 600 20.9 172800 138413 80 + 3 57600 900 21.2 172800 136455 78 + 3 115200 600 21.7 345600 133311 38 + 3 57600 400 22.5 172800 128571 74 + 4 38400 900 25.2 153600 114795 74 + 4 38400 600 26.4 153600 109577 71 + 4 38400 400 27.3 153600 105965 68 + 2 57600 900 29.1 115200 99410.3 86 + 1 115200 900 30.7 115200 94229.3 81 + 2 57600 600 30.2 115200 95789.4 83 + 3 38400 900 30.3 115200 95473.3 82 + 3 38400 600 31.2 115200 92719.2 80 + 1 115200 600 31.3 115200 92423 80 + 2 57600 400 32.3 115200 89561.6 77 + 1 115200 400 32.8 115200 88196.3 76 + 3 38400 400 33.5 115200 86353.4 74 + 2 38400 900 43.7 76800 66197.7 86 + 2 38400 600 44 76800 65746.4 85 + 2 38400 400 47.2 76800 61289 79 + 4 19200 900 50.8 76800 56945.7 74 + 4 19200 400 53.2 76800 54376.7 70 + 4 19200 600 53.7 76800 53870.4 70 + 1 57600 900 54.6 57600 52982.4 91 + 1 57600 600 56.2 57600 51474 89 + 3 19200 900 60.5 57600 47815.5 83 + 1 57600 400 60.2 57600 48053.8 83 + 3 19200 600 62 57600 46658.7 81 + 3 19200 400 64.7 57600 44711.6 77 + 1 38400 900 79.4 38400 36433.8 94 + 1 38400 600 82.4 38400 35107.3 91 + 2 19200 900 84.4 38400 34275.4 89 + 1 38400 400 86.8 38400 33327.6 86 + 2 19200 600 87.6 38400 33023.3 85 + 2 19200 400 91.2 38400 31719.7 82 + 4 9600 900 94.7 38400 30547.4 79 + 4 9600 400 106 38400 27290.9 71 + 4 9600 600 110 38400 26298.5 68 + 3 9600 900 118 28800 24515.6 85 + 3 9600 600 120 28800 24107 83 + 3 9600 400 131 28800 22082.7 76 + 1 19200 900 155 19200 18663.5 97 + 1 19200 600 161 19200 17968 93 + 1 19200 400 170 19200 17016.7 88 + 2 9600 600 176 19200 16436.6 85 + 2 9600 900 180 19200 16071.3 83 + 2 9600 400 181 19200 15982.5 83 + 1 9600 900 305 9600 9484.72 98 + 1 9600 600 314 9600 9212.87 95 + 1 9600 400 332 9600 8713.37 90 + ====== ======== === ======== ======= ======= === + +5.2. Anthony Healy's Report +--------------------------- + + :: + + Date: Mon, 13 Feb 1995 16:17:29 +1100 (EST) + From: Antony Healey + To: Simon Janes + Subject: Re: Load Balancing + + Hi Simon, + I've installed your patch and it works great. I have trialed + it over twin SL/IP lines, just over null modems, but I was + able to data at over 48Kb/s [ISDN link -Simon]. I managed a + transfer of up to 7.5 Kbyte/s on one go, but averaged around + 6.4 Kbyte/s, which I think is pretty cool. :) diff --git a/Documentation/networking/eql.txt b/Documentation/networking/eql.txt deleted file mode 100644 index 0f1550150f05..000000000000 --- a/Documentation/networking/eql.txt +++ /dev/null @@ -1,528 +0,0 @@ - EQL Driver: Serial IP Load Balancing HOWTO - Simon "Guru Aleph-Null" Janes, simon@ncm.com - v1.1, February 27, 1995 - - This is the manual for the EQL device driver. EQL is a software device - that lets you load-balance IP serial links (SLIP or uncompressed PPP) - to increase your bandwidth. It will not reduce your latency (i.e. ping - times) except in the case where you already have lots of traffic on - your link, in which it will help them out. This driver has been tested - with the 1.1.75 kernel, and is known to have patched cleanly with - 1.1.86. Some testing with 1.1.92 has been done with the v1.1 patch - which was only created to patch cleanly in the very latest kernel - source trees. (Yes, it worked fine.) - - 1. Introduction - - Which is worse? A huge fee for a 56K leased line or two phone lines? - It's probably the former. If you find yourself craving more bandwidth, - and have a ISP that is flexible, it is now possible to bind modems - together to work as one point-to-point link to increase your - bandwidth. All without having to have a special black box on either - side. - - - The eql driver has only been tested with the Livingston PortMaster-2e - terminal server. I do not know if other terminal servers support load- - balancing, but I do know that the PortMaster does it, and does it - almost as well as the eql driver seems to do it (-- Unfortunately, in - my testing so far, the Livingston PortMaster 2e's load-balancing is a - good 1 to 2 KB/s slower than the test machine working with a 28.8 Kbps - and 14.4 Kbps connection. However, I am not sure that it really is - the PortMaster, or if it's Linux's TCP drivers. I'm told that Linux's - TCP implementation is pretty fast though.--) - - - I suggest to ISPs out there that it would probably be fair to charge - a load-balancing client 75% of the cost of the second line and 50% of - the cost of the third line etc... - - - Hey, we can all dream you know... - - - 2. Kernel Configuration - - Here I describe the general steps of getting a kernel up and working - with the eql driver. From patching, building, to installing. - - - 2.1. Patching The Kernel - - If you do not have or cannot get a copy of the kernel with the eql - driver folded into it, get your copy of the driver from - ftp://slaughter.ncm.com/pub/Linux/LOAD_BALANCING/eql-1.1.tar.gz. - Unpack this archive someplace obvious like /usr/local/src/. It will - create the following files: - - - - ______________________________________________________________________ - -rw-r--r-- guru/ncm 198 Jan 19 18:53 1995 eql-1.1/NO-WARRANTY - -rw-r--r-- guru/ncm 30620 Feb 27 21:40 1995 eql-1.1/eql-1.1.patch - -rwxr-xr-x guru/ncm 16111 Jan 12 22:29 1995 eql-1.1/eql_enslave - -rw-r--r-- guru/ncm 2195 Jan 10 21:48 1995 eql-1.1/eql_enslave.c - ______________________________________________________________________ - - Unpack a recent kernel (something after 1.1.92) someplace convenient - like say /usr/src/linux-1.1.92.eql. Use symbolic links to point - /usr/src/linux to this development directory. - - - Apply the patch by running the commands: - - - ______________________________________________________________________ - cd /usr/src - patch - ". Here are some example enslavings: - - - - ______________________________________________________________________ - eql_enslave eql sl0 28800 - eql_enslave eql ppp0 14400 - eql_enslave eql sl1 57600 - ______________________________________________________________________ - - - - - - When you want to free a device from its life of slavery, you can - either down the device with ifconfig (eql will automatically bury the - dead slave and remove it from its queue) or use eql_emancipate to free - it. (-- Or just ifconfig it down, and the eql driver will take it out - for you.--) - - - - ______________________________________________________________________ - eql_emancipate eql sl0 - eql_emancipate eql ppp0 - eql_emancipate eql sl1 - ______________________________________________________________________ - - - - - - 3.3. DSLIP Configuration for the eql Device - - The general idea is to bring up and keep up as many SLIP connections - as you need, automatically. - - - 3.3.1. /etc/slip/runslip.conf - - Here is an example runslip.conf: - - - - - - - - - - - - - - - - ______________________________________________________________________ - name sl-line-1 - enabled - baud 38400 - mtu 576 - ducmd -e /etc/slip/dialout/cua2-288.xp -t 9 - command eql_enslave eql $interface 28800 - address 198.67.33.239 - line /dev/cua2 - - name sl-line-2 - enabled - baud 38400 - mtu 576 - ducmd -e /etc/slip/dialout/cua3-288.xp -t 9 - command eql_enslave eql $interface 28800 - address 198.67.33.239 - line /dev/cua3 - ______________________________________________________________________ - - - - - - 3.4. Using PPP and the eql Device - - I have not yet done any load-balancing testing for PPP devices, mainly - because I don't have a PPP-connection manager like SLIP has with - DSLIP. I did find a good tip from LinuxNET:Billy for PPP performance: - make sure you have asyncmap set to something so that control - characters are not escaped. - - - I tried to fix up a PPP script/system for redialing lost PPP - connections for use with the eql driver the weekend of Feb 25-26 '95 - (Hereafter known as the 8-hour PPP Hate Festival). Perhaps later this - year. - - - 4. About the Slave Scheduler Algorithm - - The slave scheduler probably could be replaced with a dozen other - things and push traffic much faster. The formula in the current set - up of the driver was tuned to handle slaves with wildly different - bits-per-second "priorities". - - - All testing I have done was with two 28.8 V.FC modems, one connecting - at 28800 bps or slower, and the other connecting at 14400 bps all the - time. - - - One version of the scheduler was able to push 5.3 K/s through the - 28800 and 14400 connections, but when the priorities on the links were - very wide apart (57600 vs. 14400) the "faster" modem received all - traffic and the "slower" modem starved. - - - 5. Testers' Reports - - Some people have experimented with the eql device with newer - kernels (than 1.1.75). I have since updated the driver to patch - cleanly in newer kernels because of the removal of the old "slave- - balancing" driver config option. - - - o icee from LinuxNET patched 1.1.86 without any rejects and was able - to boot the kernel and enslave a couple of ISDN PPP links. - - 5.1. Randolph Bentson's Test Report - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - From bentson@grieg.seaslug.org Wed Feb 8 19:08:09 1995 - Date: Tue, 7 Feb 95 22:57 PST - From: Randolph Bentson - To: guru@ncm.com - Subject: EQL driver tests - - - I have been checking out your eql driver. (Nice work, that!) - Although you may already done this performance testing, here - are some data I've discovered. - - Randolph Bentson - bentson@grieg.seaslug.org - - --------------------------------------------------------- - - - A pseudo-device driver, EQL, written by Simon Janes, can be used - to bundle multiple SLIP connections into what appears to be a - single connection. This allows one to improve dial-up network - connectivity gradually, without having to buy expensive DSU/CSU - hardware and services. - - I have done some testing of this software, with two goals in - mind: first, to ensure it actually works as described and - second, as a method of exercising my device driver. - - The following performance measurements were derived from a set - of SLIP connections run between two Linux systems (1.1.84) using - a 486DX2/66 with a Cyclom-8Ys and a 486SLC/40 with a Cyclom-16Y. - (Ports 0,1,2,3 were used. A later configuration will distribute - port selection across the different Cirrus chips on the boards.) - Once a link was established, I timed a binary ftp transfer of - 289284 bytes of data. If there were no overhead (packet headers, - inter-character and inter-packet delays, etc.) the transfers - would take the following times: - - bits/sec seconds - 345600 8.3 - 234600 12.3 - 172800 16.7 - 153600 18.8 - 76800 37.6 - 57600 50.2 - 38400 75.3 - 28800 100.4 - 19200 150.6 - 9600 301.3 - - A single line running at the lower speeds and with large packets - comes to within 2% of this. Performance is limited for the higher - speeds (as predicted by the Cirrus databook) to an aggregate of - about 160 kbits/sec. The next round of testing will distribute - the load across two or more Cirrus chips. - - The good news is that one gets nearly the full advantage of the - second, third, and fourth line's bandwidth. (The bad news is - that the connection establishment seemed fragile for the higher - speeds. Once established, the connection seemed robust enough.) - - #lines speed mtu seconds theory actual %of - kbit/sec duration speed speed max - 3 115200 900 _ 345600 - 3 115200 400 18.1 345600 159825 46 - 2 115200 900 _ 230400 - 2 115200 600 18.1 230400 159825 69 - 2 115200 400 19.3 230400 149888 65 - 4 57600 900 _ 234600 - 4 57600 600 _ 234600 - 4 57600 400 _ 234600 - 3 57600 600 20.9 172800 138413 80 - 3 57600 900 21.2 172800 136455 78 - 3 115200 600 21.7 345600 133311 38 - 3 57600 400 22.5 172800 128571 74 - 4 38400 900 25.2 153600 114795 74 - 4 38400 600 26.4 153600 109577 71 - 4 38400 400 27.3 153600 105965 68 - 2 57600 900 29.1 115200 99410.3 86 - 1 115200 900 30.7 115200 94229.3 81 - 2 57600 600 30.2 115200 95789.4 83 - 3 38400 900 30.3 115200 95473.3 82 - 3 38400 600 31.2 115200 92719.2 80 - 1 115200 600 31.3 115200 92423 80 - 2 57600 400 32.3 115200 89561.6 77 - 1 115200 400 32.8 115200 88196.3 76 - 3 38400 400 33.5 115200 86353.4 74 - 2 38400 900 43.7 76800 66197.7 86 - 2 38400 600 44 76800 65746.4 85 - 2 38400 400 47.2 76800 61289 79 - 4 19200 900 50.8 76800 56945.7 74 - 4 19200 400 53.2 76800 54376.7 70 - 4 19200 600 53.7 76800 53870.4 70 - 1 57600 900 54.6 57600 52982.4 91 - 1 57600 600 56.2 57600 51474 89 - 3 19200 900 60.5 57600 47815.5 83 - 1 57600 400 60.2 57600 48053.8 83 - 3 19200 600 62 57600 46658.7 81 - 3 19200 400 64.7 57600 44711.6 77 - 1 38400 900 79.4 38400 36433.8 94 - 1 38400 600 82.4 38400 35107.3 91 - 2 19200 900 84.4 38400 34275.4 89 - 1 38400 400 86.8 38400 33327.6 86 - 2 19200 600 87.6 38400 33023.3 85 - 2 19200 400 91.2 38400 31719.7 82 - 4 9600 900 94.7 38400 30547.4 79 - 4 9600 400 106 38400 27290.9 71 - 4 9600 600 110 38400 26298.5 68 - 3 9600 900 118 28800 24515.6 85 - 3 9600 600 120 28800 24107 83 - 3 9600 400 131 28800 22082.7 76 - 1 19200 900 155 19200 18663.5 97 - 1 19200 600 161 19200 17968 93 - 1 19200 400 170 19200 17016.7 88 - 2 9600 600 176 19200 16436.6 85 - 2 9600 900 180 19200 16071.3 83 - 2 9600 400 181 19200 15982.5 83 - 1 9600 900 305 9600 9484.72 98 - 1 9600 600 314 9600 9212.87 95 - 1 9600 400 332 9600 8713.37 90 - - - - - - 5.2. Anthony Healy's Report - - - - - - - - Date: Mon, 13 Feb 1995 16:17:29 +1100 (EST) - From: Antony Healey - To: Simon Janes - Subject: Re: Load Balancing - - Hi Simon, - I've installed your patch and it works great. I have trialed - it over twin SL/IP lines, just over null modems, but I was - able to data at over 48Kb/s [ISDN link -Simon]. I managed a - transfer of up to 7.5 Kbyte/s on one go, but averaged around - 6.4 Kbyte/s, which I think is pretty cool. :) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 313f66900bce..9ef6ef42bdc5 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -54,6 +54,7 @@ Contents: defza dns_resolver driver + eql .. only:: subproject and html diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index 4ab6d343fd86..c822f4a6d166 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -126,7 +126,7 @@ config EQUALIZER Linux driver or with a Livingston Portmaster 2e. Say Y if you want this and read - . You may also want to read + . You may also want to read section 6.2 of the NET-3-HOWTO, available from . -- cgit v1.2.3 From aee113427c5d205730b2c1a023661799f41aca23 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:35 +0200 Subject: docs: networking: convert fib_trie.txt to ReST - add SPDX header; - adjust title markup; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/fib_trie.rst | 149 ++++++++++++++++++++++++++++++++++ Documentation/networking/fib_trie.txt | 145 --------------------------------- Documentation/networking/index.rst | 1 + 3 files changed, 150 insertions(+), 145 deletions(-) create mode 100644 Documentation/networking/fib_trie.rst delete mode 100644 Documentation/networking/fib_trie.txt (limited to 'Documentation') diff --git a/Documentation/networking/fib_trie.rst b/Documentation/networking/fib_trie.rst new file mode 100644 index 000000000000..f1435b7fcdb7 --- /dev/null +++ b/Documentation/networking/fib_trie.rst @@ -0,0 +1,149 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================ +LC-trie implementation notes +============================ + +Node types +---------- +leaf + An end node with data. This has a copy of the relevant key, along + with 'hlist' with routing table entries sorted by prefix length. + See struct leaf and struct leaf_info. + +trie node or tnode + An internal node, holding an array of child (leaf or tnode) pointers, + indexed through a subset of the key. See Level Compression. + +A few concepts explained +------------------------ +Bits (tnode) + The number of bits in the key segment used for indexing into the + child array - the "child index". See Level Compression. + +Pos (tnode) + The position (in the key) of the key segment used for indexing into + the child array. See Path Compression. + +Path Compression / skipped bits + Any given tnode is linked to from the child array of its parent, using + a segment of the key specified by the parent's "pos" and "bits" + In certain cases, this tnode's own "pos" will not be immediately + adjacent to the parent (pos+bits), but there will be some bits + in the key skipped over because they represent a single path with no + deviations. These "skipped bits" constitute Path Compression. + Note that the search algorithm will simply skip over these bits when + searching, making it necessary to save the keys in the leaves to + verify that they actually do match the key we are searching for. + +Level Compression / child arrays + the trie is kept level balanced moving, under certain conditions, the + children of a full child (see "full_children") up one level, so that + instead of a pure binary tree, each internal node ("tnode") may + contain an arbitrarily large array of links to several children. + Conversely, a tnode with a mostly empty child array (see empty_children) + may be "halved", having some of its children moved downwards one level, + in order to avoid ever-increasing child arrays. + +empty_children + the number of positions in the child array of a given tnode that are + NULL. + +full_children + the number of children of a given tnode that aren't path compressed. + (in other words, they aren't NULL or leaves and their "pos" is equal + to this tnode's "pos"+"bits"). + + (The word "full" here is used more in the sense of "complete" than + as the opposite of "empty", which might be a tad confusing.) + +Comments +--------- + +We have tried to keep the structure of the code as close to fib_hash as +possible to allow verification and help up reviewing. + +fib_find_node() + A good start for understanding this code. This function implements a + straightforward trie lookup. + +fib_insert_node() + Inserts a new leaf node in the trie. This is bit more complicated than + fib_find_node(). Inserting a new node means we might have to run the + level compression algorithm on part of the trie. + +trie_leaf_remove() + Looks up a key, deletes it and runs the level compression algorithm. + +trie_rebalance() + The key function for the dynamic trie after any change in the trie + it is run to optimize and reorganize. It will walk the trie upwards + towards the root from a given tnode, doing a resize() at each step + to implement level compression. + +resize() + Analyzes a tnode and optimizes the child array size by either inflating + or shrinking it repeatedly until it fulfills the criteria for optimal + level compression. This part follows the original paper pretty closely + and there may be some room for experimentation here. + +inflate() + Doubles the size of the child array within a tnode. Used by resize(). + +halve() + Halves the size of the child array within a tnode - the inverse of + inflate(). Used by resize(); + +fn_trie_insert(), fn_trie_delete(), fn_trie_select_default() + The route manipulation functions. Should conform pretty closely to the + corresponding functions in fib_hash. + +fn_trie_flush() + This walks the full trie (using nextleaf()) and searches for empty + leaves which have to be removed. + +fn_trie_dump() + Dumps the routing table ordered by prefix length. This is somewhat + slower than the corresponding fib_hash function, as we have to walk the + entire trie for each prefix length. In comparison, fib_hash is organized + as one "zone"/hash per prefix length. + +Locking +------- + +fib_lock is used for an RW-lock in the same way that this is done in fib_hash. +However, the functions are somewhat separated for other possible locking +scenarios. It might conceivably be possible to run trie_rebalance via RCU +to avoid read_lock in the fn_trie_lookup() function. + +Main lookup mechanism +--------------------- +fn_trie_lookup() is the main lookup function. + +The lookup is in its simplest form just like fib_find_node(). We descend the +trie, key segment by key segment, until we find a leaf. check_leaf() does +the fib_semantic_match in the leaf's sorted prefix hlist. + +If we find a match, we are done. + +If we don't find a match, we enter prefix matching mode. The prefix length, +starting out at the same as the key length, is reduced one step at a time, +and we backtrack upwards through the trie trying to find a longest matching +prefix. The goal is always to reach a leaf and get a positive result from the +fib_semantic_match mechanism. + +Inside each tnode, the search for longest matching prefix consists of searching +through the child array, chopping off (zeroing) the least significant "1" of +the child index until we find a match or the child index consists of nothing but +zeros. + +At this point we backtrack (t->stats.backtrack++) up the trie, continuing to +chop off part of the key in order to find the longest matching prefix. + +At this point we will repeatedly descend subtries to look for a match, and there +are some optimizations available that can provide us with "shortcuts" to avoid +descending into dead ends. Look for "HL_OPTIMIZE" sections in the code. + +To alleviate any doubts about the correctness of the route selection process, +a new netlink operation has been added. Look for NETLINK_FIB_LOOKUP, which +gives userland access to fib_lookup(). diff --git a/Documentation/networking/fib_trie.txt b/Documentation/networking/fib_trie.txt deleted file mode 100644 index fe719388518b..000000000000 --- a/Documentation/networking/fib_trie.txt +++ /dev/null @@ -1,145 +0,0 @@ - LC-trie implementation notes. - -Node types ----------- -leaf - An end node with data. This has a copy of the relevant key, along - with 'hlist' with routing table entries sorted by prefix length. - See struct leaf and struct leaf_info. - -trie node or tnode - An internal node, holding an array of child (leaf or tnode) pointers, - indexed through a subset of the key. See Level Compression. - -A few concepts explained ------------------------- -Bits (tnode) - The number of bits in the key segment used for indexing into the - child array - the "child index". See Level Compression. - -Pos (tnode) - The position (in the key) of the key segment used for indexing into - the child array. See Path Compression. - -Path Compression / skipped bits - Any given tnode is linked to from the child array of its parent, using - a segment of the key specified by the parent's "pos" and "bits" - In certain cases, this tnode's own "pos" will not be immediately - adjacent to the parent (pos+bits), but there will be some bits - in the key skipped over because they represent a single path with no - deviations. These "skipped bits" constitute Path Compression. - Note that the search algorithm will simply skip over these bits when - searching, making it necessary to save the keys in the leaves to - verify that they actually do match the key we are searching for. - -Level Compression / child arrays - the trie is kept level balanced moving, under certain conditions, the - children of a full child (see "full_children") up one level, so that - instead of a pure binary tree, each internal node ("tnode") may - contain an arbitrarily large array of links to several children. - Conversely, a tnode with a mostly empty child array (see empty_children) - may be "halved", having some of its children moved downwards one level, - in order to avoid ever-increasing child arrays. - -empty_children - the number of positions in the child array of a given tnode that are - NULL. - -full_children - the number of children of a given tnode that aren't path compressed. - (in other words, they aren't NULL or leaves and their "pos" is equal - to this tnode's "pos"+"bits"). - - (The word "full" here is used more in the sense of "complete" than - as the opposite of "empty", which might be a tad confusing.) - -Comments ---------- - -We have tried to keep the structure of the code as close to fib_hash as -possible to allow verification and help up reviewing. - -fib_find_node() - A good start for understanding this code. This function implements a - straightforward trie lookup. - -fib_insert_node() - Inserts a new leaf node in the trie. This is bit more complicated than - fib_find_node(). Inserting a new node means we might have to run the - level compression algorithm on part of the trie. - -trie_leaf_remove() - Looks up a key, deletes it and runs the level compression algorithm. - -trie_rebalance() - The key function for the dynamic trie after any change in the trie - it is run to optimize and reorganize. It will walk the trie upwards - towards the root from a given tnode, doing a resize() at each step - to implement level compression. - -resize() - Analyzes a tnode and optimizes the child array size by either inflating - or shrinking it repeatedly until it fulfills the criteria for optimal - level compression. This part follows the original paper pretty closely - and there may be some room for experimentation here. - -inflate() - Doubles the size of the child array within a tnode. Used by resize(). - -halve() - Halves the size of the child array within a tnode - the inverse of - inflate(). Used by resize(); - -fn_trie_insert(), fn_trie_delete(), fn_trie_select_default() - The route manipulation functions. Should conform pretty closely to the - corresponding functions in fib_hash. - -fn_trie_flush() - This walks the full trie (using nextleaf()) and searches for empty - leaves which have to be removed. - -fn_trie_dump() - Dumps the routing table ordered by prefix length. This is somewhat - slower than the corresponding fib_hash function, as we have to walk the - entire trie for each prefix length. In comparison, fib_hash is organized - as one "zone"/hash per prefix length. - -Locking -------- - -fib_lock is used for an RW-lock in the same way that this is done in fib_hash. -However, the functions are somewhat separated for other possible locking -scenarios. It might conceivably be possible to run trie_rebalance via RCU -to avoid read_lock in the fn_trie_lookup() function. - -Main lookup mechanism ---------------------- -fn_trie_lookup() is the main lookup function. - -The lookup is in its simplest form just like fib_find_node(). We descend the -trie, key segment by key segment, until we find a leaf. check_leaf() does -the fib_semantic_match in the leaf's sorted prefix hlist. - -If we find a match, we are done. - -If we don't find a match, we enter prefix matching mode. The prefix length, -starting out at the same as the key length, is reduced one step at a time, -and we backtrack upwards through the trie trying to find a longest matching -prefix. The goal is always to reach a leaf and get a positive result from the -fib_semantic_match mechanism. - -Inside each tnode, the search for longest matching prefix consists of searching -through the child array, chopping off (zeroing) the least significant "1" of -the child index until we find a match or the child index consists of nothing but -zeros. - -At this point we backtrack (t->stats.backtrack++) up the trie, continuing to -chop off part of the key in order to find the longest matching prefix. - -At this point we will repeatedly descend subtries to look for a match, and there -are some optimizations available that can provide us with "shortcuts" to avoid -descending into dead ends. Look for "HL_OPTIMIZE" sections in the code. - -To alleviate any doubts about the correctness of the route selection process, -a new netlink operation has been added. Look for NETLINK_FIB_LOOKUP, which -gives userland access to fib_lookup(). diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 9ef6ef42bdc5..807abe25ae4b 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -55,6 +55,7 @@ Contents: dns_resolver driver eql + fib_trie .. only:: subproject and html -- cgit v1.2.3 From cb3f0d56e153398a035eb22769d2cb2837f29747 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:36 +0200 Subject: docs: networking: convert filter.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - use footnote markup; - mark tables as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/bpf/index.rst | 4 +- Documentation/networking/filter.rst | 1651 ++++++++++++++++++++++++++++++ Documentation/networking/filter.txt | 1545 ---------------------------- Documentation/networking/index.rst | 1 + Documentation/networking/packet_mmap.txt | 2 +- MAINTAINERS | 2 +- tools/bpf/bpf_asm.c | 2 +- tools/bpf/bpf_dbg.c | 2 +- 8 files changed, 1658 insertions(+), 1551 deletions(-) create mode 100644 Documentation/networking/filter.rst delete mode 100644 Documentation/networking/filter.txt (limited to 'Documentation') diff --git a/Documentation/bpf/index.rst b/Documentation/bpf/index.rst index f99677f3572f..38b4db8be7a2 100644 --- a/Documentation/bpf/index.rst +++ b/Documentation/bpf/index.rst @@ -7,7 +7,7 @@ Filter) facility, with a focus on the extended BPF version (eBPF). This kernel side documentation is still work in progress. The main textual documentation is (for historical reasons) described in -`Documentation/networking/filter.txt`_, which describe both classical +`Documentation/networking/filter.rst`_, which describe both classical and extended BPF instruction-set. The Cilium project also maintains a `BPF and XDP Reference Guide`_ that goes into great technical depth about the BPF Architecture. @@ -59,7 +59,7 @@ Testing and debugging BPF .. Links: -.. _Documentation/networking/filter.txt: ../networking/filter.txt +.. _Documentation/networking/filter.rst: ../networking/filter.txt .. _man-pages: https://www.kernel.org/doc/man-pages/ .. _bpf(2): http://man7.org/linux/man-pages/man2/bpf.2.html .. _BPF and XDP Reference Guide: http://cilium.readthedocs.io/en/latest/bpf/ diff --git a/Documentation/networking/filter.rst b/Documentation/networking/filter.rst new file mode 100644 index 000000000000..a1d3e192b9fa --- /dev/null +++ b/Documentation/networking/filter.rst @@ -0,0 +1,1651 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================================= +Linux Socket Filtering aka Berkeley Packet Filter (BPF) +======================================================= + +Introduction +------------ + +Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter. +Though there are some distinct differences between the BSD and Linux +Kernel filtering, but when we speak of BPF or LSF in Linux context, we +mean the very same mechanism of filtering in the Linux kernel. + +BPF allows a user-space program to attach a filter onto any socket and +allow or disallow certain types of data to come through the socket. LSF +follows exactly the same filter code structure as BSD's BPF, so referring +to the BSD bpf.4 manpage is very helpful in creating filters. + +On Linux, BPF is much simpler than on BSD. One does not have to worry +about devices or anything like that. You simply create your filter code, +send it to the kernel via the SO_ATTACH_FILTER option and if your filter +code passes the kernel check on it, you then immediately begin filtering +data on that socket. + +You can also detach filters from your socket via the SO_DETACH_FILTER +option. This will probably not be used much since when you close a socket +that has a filter on it the filter is automagically removed. The other +less common case may be adding a different filter on the same socket where +you had another filter that is still running: the kernel takes care of +removing the old one and placing your new one in its place, assuming your +filter has passed the checks, otherwise if it fails the old filter will +remain on that socket. + +SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once +set, a filter cannot be removed or changed. This allows one process to +setup a socket, attach a filter, lock it then drop privileges and be +assured that the filter will be kept until the socket is closed. + +The biggest user of this construct might be libpcap. Issuing a high-level +filter command like `tcpdump -i em1 port 22` passes through the libpcap +internal compiler that generates a structure that can eventually be loaded +via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd` +displays what is being placed into this structure. + +Although we were only speaking about sockets here, BPF in Linux is used +in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel +qdisc layer, SECCOMP-BPF (SECure COMPuting [1]_), and lots of other places +such as team driver, PTP code, etc where BPF is being used. + +.. [1] Documentation/userspace-api/seccomp_filter.rst + +Original BPF paper: + +Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new +architecture for user-level packet capture. In Proceedings of the +USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993 +Conference Proceedings (USENIX'93). USENIX Association, Berkeley, +CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf] + +Structure +--------- + +User space applications include which contains the +following relevant structures:: + + struct sock_filter { /* Filter block */ + __u16 code; /* Actual filter code */ + __u8 jt; /* Jump true */ + __u8 jf; /* Jump false */ + __u32 k; /* Generic multiuse field */ + }; + +Such a structure is assembled as an array of 4-tuples, that contains +a code, jt, jf and k value. jt and jf are jump offsets and k a generic +value to be used for a provided code:: + + struct sock_fprog { /* Required for SO_ATTACH_FILTER. */ + unsigned short len; /* Number of filter blocks */ + struct sock_filter __user *filter; + }; + +For socket filtering, a pointer to this structure (as shown in +follow-up example) is being passed to the kernel through setsockopt(2). + +Example +------- + +:: + + #include + #include + #include + #include + /* ... */ + + /* From the example above: tcpdump -i em1 port 22 -dd */ + struct sock_filter code[] = { + { 0x28, 0, 0, 0x0000000c }, + { 0x15, 0, 8, 0x000086dd }, + { 0x30, 0, 0, 0x00000014 }, + { 0x15, 2, 0, 0x00000084 }, + { 0x15, 1, 0, 0x00000006 }, + { 0x15, 0, 17, 0x00000011 }, + { 0x28, 0, 0, 0x00000036 }, + { 0x15, 14, 0, 0x00000016 }, + { 0x28, 0, 0, 0x00000038 }, + { 0x15, 12, 13, 0x00000016 }, + { 0x15, 0, 12, 0x00000800 }, + { 0x30, 0, 0, 0x00000017 }, + { 0x15, 2, 0, 0x00000084 }, + { 0x15, 1, 0, 0x00000006 }, + { 0x15, 0, 8, 0x00000011 }, + { 0x28, 0, 0, 0x00000014 }, + { 0x45, 6, 0, 0x00001fff }, + { 0xb1, 0, 0, 0x0000000e }, + { 0x48, 0, 0, 0x0000000e }, + { 0x15, 2, 0, 0x00000016 }, + { 0x48, 0, 0, 0x00000010 }, + { 0x15, 0, 1, 0x00000016 }, + { 0x06, 0, 0, 0x0000ffff }, + { 0x06, 0, 0, 0x00000000 }, + }; + + struct sock_fprog bpf = { + .len = ARRAY_SIZE(code), + .filter = code, + }; + + sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); + if (sock < 0) + /* ... bail out ... */ + + ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf)); + if (ret < 0) + /* ... bail out ... */ + + /* ... */ + close(sock); + +The above example code attaches a socket filter for a PF_PACKET socket +in order to let all IPv4/IPv6 packets with port 22 pass. The rest will +be dropped for this socket. + +The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments +and SO_LOCK_FILTER for preventing the filter to be detached, takes an +integer value with 0 or 1. + +Note that socket filters are not restricted to PF_PACKET sockets only, +but can also be used on other socket families. + +Summary of system calls: + + * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val)); + * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val)); + * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER, &val, sizeof(val)); + +Normally, most use cases for socket filtering on packet sockets will be +covered by libpcap in high-level syntax, so as an application developer +you should stick to that. libpcap wraps its own layer around all that. + +Unless i) using/linking to libpcap is not an option, ii) the required BPF +filters use Linux extensions that are not supported by libpcap's compiler, +iii) a filter might be more complex and not cleanly implementable with +libpcap's compiler, or iv) particular filter codes should be optimized +differently than libpcap's internal compiler does; then in such cases +writing such a filter "by hand" can be of an alternative. For example, +xt_bpf and cls_bpf users might have requirements that could result in +more complex filter code, or one that cannot be expressed with libpcap +(e.g. different return codes for various code paths). Moreover, BPF JIT +implementors may wish to manually write test cases and thus need low-level +access to BPF code as well. + +BPF engine and instruction set +------------------------------ + +Under tools/bpf/ there's a small helper tool called bpf_asm which can +be used to write low-level filters for example scenarios mentioned in the +previous section. Asm-like syntax mentioned here has been implemented in +bpf_asm and will be used for further explanations (instead of dealing with +less readable opcodes directly, principles are the same). The syntax is +closely modelled after Steven McCanne's and Van Jacobson's BPF paper. + +The BPF architecture consists of the following basic elements: + + ======= ==================================================== + Element Description + ======= ==================================================== + A 32 bit wide accumulator + X 32 bit wide X register + M[] 16 x 32 bit wide misc registers aka "scratch memory + store", addressable from 0 to 15 + ======= ==================================================== + +A program, that is translated by bpf_asm into "opcodes" is an array that +consists of the following elements (as already mentioned):: + + op:16, jt:8, jf:8, k:32 + +The element op is a 16 bit wide opcode that has a particular instruction +encoded. jt and jf are two 8 bit wide jump targets, one for condition +"jump if true", the other one "jump if false". Eventually, element k +contains a miscellaneous argument that can be interpreted in different +ways depending on the given instruction in op. + +The instruction set consists of load, store, branch, alu, miscellaneous +and return instructions that are also represented in bpf_asm syntax. This +table lists all bpf_asm instructions available resp. what their underlying +opcodes as defined in linux/filter.h stand for: + + =========== =================== ===================== + Instruction Addressing mode Description + =========== =================== ===================== + ld 1, 2, 3, 4, 12 Load word into A + ldi 4 Load word into A + ldh 1, 2 Load half-word into A + ldb 1, 2 Load byte into A + ldx 3, 4, 5, 12 Load word into X + ldxi 4 Load word into X + ldxb 5 Load byte into X + + st 3 Store A into M[] + stx 3 Store X into M[] + + jmp 6 Jump to label + ja 6 Jump to label + jeq 7, 8, 9, 10 Jump on A == + jneq 9, 10 Jump on A != + jne 9, 10 Jump on A != + jlt 9, 10 Jump on A < + jle 9, 10 Jump on A <= + jgt 7, 8, 9, 10 Jump on A > + jge 7, 8, 9, 10 Jump on A >= + jset 7, 8, 9, 10 Jump on A & + + add 0, 4 A + + sub 0, 4 A - + mul 0, 4 A * + div 0, 4 A / + mod 0, 4 A % + neg !A + and 0, 4 A & + or 0, 4 A | + xor 0, 4 A ^ + lsh 0, 4 A << + rsh 0, 4 A >> + + tax Copy A into X + txa Copy X into A + + ret 4, 11 Return + =========== =================== ===================== + +The next table shows addressing formats from the 2nd column: + + =============== =================== =============================================== + Addressing mode Syntax Description + =============== =================== =============================================== + 0 x/%x Register X + 1 [k] BHW at byte offset k in the packet + 2 [x + k] BHW at the offset X + k in the packet + 3 M[k] Word at offset k in M[] + 4 #k Literal value stored in k + 5 4*([k]&0xf) Lower nibble * 4 at byte offset k in the packet + 6 L Jump label L + 7 #k,Lt,Lf Jump to Lt if true, otherwise jump to Lf + 8 x/%x,Lt,Lf Jump to Lt if true, otherwise jump to Lf + 9 #k,Lt Jump to Lt if predicate is true + 10 x/%x,Lt Jump to Lt if predicate is true + 11 a/%a Accumulator A + 12 extension BPF extension + =============== =================== =============================================== + +The Linux kernel also has a couple of BPF extensions that are used along +with the class of load instructions by "overloading" the k argument with +a negative offset + a particular extension offset. The result of such BPF +extensions are loaded into A. + +Possible BPF extensions are shown in the following table: + + =================================== ================================================= + Extension Description + =================================== ================================================= + len skb->len + proto skb->protocol + type skb->pkt_type + poff Payload start offset + ifidx skb->dev->ifindex + nla Netlink attribute of type X with offset A + nlan Nested Netlink attribute of type X with offset A + mark skb->mark + queue skb->queue_mapping + hatype skb->dev->type + rxhash skb->hash + cpu raw_smp_processor_id() + vlan_tci skb_vlan_tag_get(skb) + vlan_avail skb_vlan_tag_present(skb) + vlan_tpid skb->vlan_proto + rand prandom_u32() + =================================== ================================================= + +These extensions can also be prefixed with '#'. +Examples for low-level BPF: + +**ARP packets**:: + + ldh [12] + jne #0x806, drop + ret #-1 + drop: ret #0 + +**IPv4 TCP packets**:: + + ldh [12] + jne #0x800, drop + ldb [23] + jneq #6, drop + ret #-1 + drop: ret #0 + +**(Accelerated) VLAN w/ id 10**:: + + ld vlan_tci + jneq #10, drop + ret #-1 + drop: ret #0 + +**icmp random packet sampling, 1 in 4**: + + ldh [12] + jne #0x800, drop + ldb [23] + jneq #1, drop + # get a random uint32 number + ld rand + mod #4 + jneq #1, drop + ret #-1 + drop: ret #0 + +**SECCOMP filter example**:: + + ld [4] /* offsetof(struct seccomp_data, arch) */ + jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */ + ld [0] /* offsetof(struct seccomp_data, nr) */ + jeq #15, good /* __NR_rt_sigreturn */ + jeq #231, good /* __NR_exit_group */ + jeq #60, good /* __NR_exit */ + jeq #0, good /* __NR_read */ + jeq #1, good /* __NR_write */ + jeq #5, good /* __NR_fstat */ + jeq #9, good /* __NR_mmap */ + jeq #14, good /* __NR_rt_sigprocmask */ + jeq #13, good /* __NR_rt_sigaction */ + jeq #35, good /* __NR_nanosleep */ + bad: ret #0 /* SECCOMP_RET_KILL_THREAD */ + good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */ + +The above example code can be placed into a file (here called "foo"), and +then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf +and cls_bpf understands and can directly be loaded with. Example with above +ARP code:: + + $ ./bpf_asm foo + 4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0, + +In copy and paste C-like output:: + + $ ./bpf_asm -c foo + { 0x28, 0, 0, 0x0000000c }, + { 0x15, 0, 1, 0x00000806 }, + { 0x06, 0, 0, 0xffffffff }, + { 0x06, 0, 0, 0000000000 }, + +In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF +filters that might not be obvious at first, it's good to test filters before +attaching to a live system. For that purpose, there's a small tool called +bpf_dbg under tools/bpf/ in the kernel source directory. This debugger allows +for testing BPF filters against given pcap files, single stepping through the +BPF code on the pcap's packets and to do BPF machine register dumps. + +Starting bpf_dbg is trivial and just requires issuing:: + + # ./bpf_dbg + +In case input and output do not equal stdin/stdout, bpf_dbg takes an +alternative stdin source as a first argument, and an alternative stdout +sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`. + +Other than that, a particular libreadline configuration can be set via +file "~/.bpf_dbg_init" and the command history is stored in the file +"~/.bpf_dbg_history". + +Interaction in bpf_dbg happens through a shell that also has auto-completion +support (follow-up example commands starting with '>' denote bpf_dbg shell). +The usual workflow would be to ... + +* load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0 + Loads a BPF filter from standard output of bpf_asm, or transformed via + e.g. ``tcpdump -iem1 -ddd port 22 | tr '\n' ','``. Note that for JIT + debugging (next section), this command creates a temporary socket and + loads the BPF code into the kernel. Thus, this will also be useful for + JIT developers. + +* load pcap foo.pcap + + Loads standard tcpdump pcap file. + +* run [] + +bpf passes:1 fails:9 + Runs through all packets from a pcap to account how many passes and fails + the filter will generate. A limit of packets to traverse can be given. + +* disassemble:: + + l0: ldh [12] + l1: jeq #0x800, l2, l5 + l2: ldb [23] + l3: jeq #0x1, l4, l5 + l4: ret #0xffff + l5: ret #0 + + Prints out BPF code disassembly. + +* dump:: + + /* { op, jt, jf, k }, */ + { 0x28, 0, 0, 0x0000000c }, + { 0x15, 0, 3, 0x00000800 }, + { 0x30, 0, 0, 0x00000017 }, + { 0x15, 0, 1, 0x00000001 }, + { 0x06, 0, 0, 0x0000ffff }, + { 0x06, 0, 0, 0000000000 }, + + Prints out C-style BPF code dump. + +* breakpoint 0:: + + breakpoint at: l0: ldh [12] + +* breakpoint 1:: + + breakpoint at: l1: jeq #0x800, l2, l5 + + ... + + Sets breakpoints at particular BPF instructions. Issuing a `run` command + will walk through the pcap file continuing from the current packet and + break when a breakpoint is being hit (another `run` will continue from + the currently active breakpoint executing next instructions): + + * run:: + + -- register dump -- + pc: [0] <-- program counter + code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction + curr: l0: ldh [12] <-- disassembly of current instruction + A: [00000000][0] <-- content of A (hex, decimal) + X: [00000000][0] <-- content of X (hex, decimal) + M[0,15]: [00000000][0] <-- folded content of M (hex, decimal) + -- packet dump -- <-- Current packet from pcap (hex) + len: 42 + 0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01 + 16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26 + 32: 00 00 00 00 00 00 0a 3b 01 01 + (breakpoint) + > + + * breakpoint:: + + breakpoints: 0 1 + + Prints currently set breakpoints. + +* step [-, +] + + Performs single stepping through the BPF program from the current pc + offset. Thus, on each step invocation, above register dump is issued. + This can go forwards and backwards in time, a plain `step` will break + on the next BPF instruction, thus +1. (No `run` needs to be issued here.) + +* select + + Selects a given packet from the pcap file to continue from. Thus, on + the next `run` or `step`, the BPF program is being evaluated against + the user pre-selected packet. Numbering starts just as in Wireshark + with index 1. + +* quit + + Exits bpf_dbg. + +JIT compiler +------------ + +The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, +PowerPC, ARM, ARM64, MIPS, RISC-V and s390 and can be enabled through +CONFIG_BPF_JIT. The JIT compiler is transparently invoked for each +attached filter from user space or for internal kernel users if it has +been previously enabled by root:: + + echo 1 > /proc/sys/net/core/bpf_jit_enable + +For JIT developers, doing audits etc, each compile run can output the generated +opcode image into the kernel log via:: + + echo 2 > /proc/sys/net/core/bpf_jit_enable + +Example output from dmesg:: + + [ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f + [ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68 + [ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00 + [ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00 + [ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00 + [ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3 + +When CONFIG_BPF_JIT_ALWAYS_ON is enabled, bpf_jit_enable is permanently set to 1 and +setting any other value than that will return in failure. This is even the case for +setting bpf_jit_enable to 2, since dumping the final JIT image into the kernel log +is discouraged and introspection through bpftool (under tools/bpf/bpftool/) is the +generally recommended approach instead. + +In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for +generating disassembly out of the kernel log's hexdump:: + + # ./bpf_jit_disasm + 70 bytes emitted from JIT compiler (pass:3, flen:6) + ffffffffa0069c8f + : + 0: push %rbp + 1: mov %rsp,%rbp + 4: sub $0x60,%rsp + 8: mov %rbx,-0x8(%rbp) + c: mov 0x68(%rdi),%r9d + 10: sub 0x6c(%rdi),%r9d + 14: mov 0xd8(%rdi),%r8 + 1b: mov $0xc,%esi + 20: callq 0xffffffffe0ff9442 + 25: cmp $0x800,%eax + 2a: jne 0x0000000000000042 + 2c: mov $0x17,%esi + 31: callq 0xffffffffe0ff945e + 36: cmp $0x1,%eax + 39: jne 0x0000000000000042 + 3b: mov $0xffff,%eax + 40: jmp 0x0000000000000044 + 42: xor %eax,%eax + 44: leaveq + 45: retq + + Issuing option `-o` will "annotate" opcodes to resulting assembler + instructions, which can be very useful for JIT developers: + + # ./bpf_jit_disasm -o + 70 bytes emitted from JIT compiler (pass:3, flen:6) + ffffffffa0069c8f + : + 0: push %rbp + 55 + 1: mov %rsp,%rbp + 48 89 e5 + 4: sub $0x60,%rsp + 48 83 ec 60 + 8: mov %rbx,-0x8(%rbp) + 48 89 5d f8 + c: mov 0x68(%rdi),%r9d + 44 8b 4f 68 + 10: sub 0x6c(%rdi),%r9d + 44 2b 4f 6c + 14: mov 0xd8(%rdi),%r8 + 4c 8b 87 d8 00 00 00 + 1b: mov $0xc,%esi + be 0c 00 00 00 + 20: callq 0xffffffffe0ff9442 + e8 1d 94 ff e0 + 25: cmp $0x800,%eax + 3d 00 08 00 00 + 2a: jne 0x0000000000000042 + 75 16 + 2c: mov $0x17,%esi + be 17 00 00 00 + 31: callq 0xffffffffe0ff945e + e8 28 94 ff e0 + 36: cmp $0x1,%eax + 83 f8 01 + 39: jne 0x0000000000000042 + 75 07 + 3b: mov $0xffff,%eax + b8 ff ff 00 00 + 40: jmp 0x0000000000000044 + eb 02 + 42: xor %eax,%eax + 31 c0 + 44: leaveq + c9 + 45: retq + c3 + +For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful +toolchain for developing and testing the kernel's JIT compiler. + +BPF kernel internals +-------------------- +Internally, for the kernel interpreter, a different instruction set +format with similar underlying principles from BPF described in previous +paragraphs is being used. However, the instruction set format is modelled +closer to the underlying architecture to mimic native instruction sets, so +that a better performance can be achieved (more details later). This new +ISA is called 'eBPF' or 'internal BPF' interchangeably. (Note: eBPF which +originates from [e]xtended BPF is not the same as BPF extensions! While +eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading' +of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.) + +It is designed to be JITed with one to one mapping, which can also open up +the possibility for GCC/LLVM compilers to generate optimized eBPF code through +an eBPF backend that performs almost as fast as natively compiled code. + +The new instruction set was originally designed with the possible goal in +mind to write programs in "restricted C" and compile into eBPF with a optional +GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with +minimal performance overhead over two steps, that is, C -> eBPF -> native code. + +Currently, the new format is being used for running user BPF programs, which +includes seccomp BPF, classic socket filters, cls_bpf traffic classifier, +team driver's classifier for its load-balancing mode, netfilter's xt_bpf +extension, PTP dissector/classifier, and much more. They are all internally +converted by the kernel into the new instruction set representation and run +in the eBPF interpreter. For in-kernel handlers, this all works transparently +by using bpf_prog_create() for setting up the filter, resp. +bpf_prog_destroy() for destroying it. The macro +BPF_PROG_RUN(filter, ctx) transparently invokes eBPF interpreter or JITed +code to run the filter. 'filter' is a pointer to struct bpf_prog that we +got from bpf_prog_create(), and 'ctx' the given context (e.g. +skb pointer). All constraints and restrictions from bpf_check_classic() apply +before a conversion to the new layout is being done behind the scenes! + +Currently, the classic BPF format is being used for JITing on most +32-bit architectures, whereas x86-64, aarch64, s390x, powerpc64, +sparc64, arm32, riscv64, riscv32 perform JIT compilation from eBPF +instruction set. + +Some core changes of the new internal format: + +- Number of registers increase from 2 to 10: + + The old format had two registers A and X, and a hidden frame pointer. The + new layout extends this to be 10 internal registers and a read-only frame + pointer. Since 64-bit CPUs are passing arguments to functions via registers + the number of args from eBPF program to in-kernel function is restricted + to 5 and one register is used to accept return value from an in-kernel + function. Natively, x86_64 passes first 6 arguments in registers, aarch64/ + sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved + registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers. + + Therefore, eBPF calling convention is defined as: + + * R0 - return value from in-kernel function, and exit value for eBPF program + * R1 - R5 - arguments from eBPF program to in-kernel function + * R6 - R9 - callee saved registers that in-kernel function will preserve + * R10 - read-only frame pointer to access stack + + Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64, + etc, and eBPF calling convention maps directly to ABIs used by the kernel on + 64-bit architectures. + + On 32-bit architectures JIT may map programs that use only 32-bit arithmetic + and may let more complex programs to be interpreted. + + R0 - R5 are scratch registers and eBPF program needs spill/fill them if + necessary across calls. Note that there is only one eBPF program (== one + eBPF main routine) and it cannot call other eBPF functions, it can only + call predefined in-kernel functions, though. + +- Register width increases from 32-bit to 64-bit: + + Still, the semantics of the original 32-bit ALU operations are preserved + via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower + subregisters that zero-extend into 64-bit if they are being written to. + That behavior maps directly to x86_64 and arm64 subregister definition, but + makes other JITs more difficult. + + 32-bit architectures run 64-bit internal BPF programs via interpreter. + Their JITs may convert BPF programs that only use 32-bit subregisters into + native instruction set and let the rest being interpreted. + + Operation is 64-bit, because on 64-bit architectures, pointers are also + 64-bit wide, and we want to pass 64-bit values in/out of kernel functions, + so 32-bit eBPF registers would otherwise require to define register-pair + ABI, thus, there won't be able to use a direct eBPF register to HW register + mapping and JIT would need to do combine/split/move operations for every + register in and out of the function, which is complex, bug prone and slow. + Another reason is the use of atomic 64-bit counters. + +- Conditional jt/jf targets replaced with jt/fall-through: + + While the original design has constructs such as ``if (cond) jump_true; + else jump_false;``, they are being replaced into alternative constructs like + ``if (cond) jump_true; /* else fall-through */``. + +- Introduces bpf_call insn and register passing convention for zero overhead + calls from/to other kernel functions: + + Before an in-kernel function call, the internal BPF program needs to + place function arguments into R1 to R5 registers to satisfy calling + convention, then the interpreter will take them from registers and pass + to in-kernel function. If R1 - R5 registers are mapped to CPU registers + that are used for argument passing on given architecture, the JIT compiler + doesn't need to emit extra moves. Function arguments will be in the correct + registers and BPF_CALL instruction will be JITed as single 'call' HW + instruction. This calling convention was picked to cover common call + situations without performance penalty. + + After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has + a return value of the function. Since R6 - R9 are callee saved, their state + is preserved across the call. + + For example, consider three C functions:: + + u64 f1() { return (*_f2)(1); } + u64 f2(u64 a) { return f3(a + 1, a); } + u64 f3(u64 a, u64 b) { return a - b; } + + GCC can compile f1, f3 into x86_64:: + + f1: + movl $1, %edi + movq _f2(%rip), %rax + jmp *%rax + f3: + movq %rdi, %rax + subq %rsi, %rax + ret + + Function f2 in eBPF may look like:: + + f2: + bpf_mov R2, R1 + bpf_add R1, 1 + bpf_call f3 + bpf_exit + + If f2 is JITed and the pointer stored to ``_f2``. The calls f1 -> f2 -> f3 and + returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to + be used to call into f2. + + For practical reasons all eBPF programs have only one argument 'ctx' which is + already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs + can call kernel functions with up to 5 arguments. Calls with 6 or more arguments + are currently not supported, but these restrictions can be lifted if necessary + in the future. + + On 64-bit architectures all register map to HW registers one to one. For + example, x86_64 JIT compiler can map them as ... + + :: + + R0 - rax + R1 - rdi + R2 - rsi + R3 - rdx + R4 - rcx + R5 - r8 + R6 - rbx + R7 - r13 + R8 - r14 + R9 - r15 + R10 - rbp + + ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing + and rbx, r12 - r15 are callee saved. + + Then the following internal BPF pseudo-program:: + + bpf_mov R6, R1 /* save ctx */ + bpf_mov R2, 2 + bpf_mov R3, 3 + bpf_mov R4, 4 + bpf_mov R5, 5 + bpf_call foo + bpf_mov R7, R0 /* save foo() return value */ + bpf_mov R1, R6 /* restore ctx for next call */ + bpf_mov R2, 6 + bpf_mov R3, 7 + bpf_mov R4, 8 + bpf_mov R5, 9 + bpf_call bar + bpf_add R0, R7 + bpf_exit + + After JIT to x86_64 may look like:: + + push %rbp + mov %rsp,%rbp + sub $0x228,%rsp + mov %rbx,-0x228(%rbp) + mov %r13,-0x220(%rbp) + mov %rdi,%rbx + mov $0x2,%esi + mov $0x3,%edx + mov $0x4,%ecx + mov $0x5,%r8d + callq foo + mov %rax,%r13 + mov %rbx,%rdi + mov $0x6,%esi + mov $0x7,%edx + mov $0x8,%ecx + mov $0x9,%r8d + callq bar + add %r13,%rax + mov -0x228(%rbp),%rbx + mov -0x220(%rbp),%r13 + leaveq + retq + + Which is in this example equivalent in C to:: + + u64 bpf_filter(u64 ctx) + { + return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9); + } + + In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64 + arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper + registers and place their return value into ``%rax`` which is R0 in eBPF. + Prologue and epilogue are emitted by JIT and are implicit in the + interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve + them across the calls as defined by calling convention. + + For example the following program is invalid:: + + bpf_mov R1, 1 + bpf_call foo + bpf_mov R0, R1 + bpf_exit + + After the call the registers R1-R5 contain junk values and cannot be read. + An in-kernel eBPF verifier is used to validate internal BPF programs. + +Also in the new design, eBPF is limited to 4096 insns, which means that any +program will terminate quickly and will only call a fixed number of kernel +functions. Original BPF and the new format are two operand instructions, +which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT. + +The input context pointer for invoking the interpreter function is generic, +its content is defined by a specific use case. For seccomp register R1 points +to seccomp_data, for converted BPF filters R1 points to a skb. + +A program, that is translated internally consists of the following elements:: + + op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32 + +So far 87 internal BPF instructions were implemented. 8-bit 'op' opcode field +has room for new instructions. Some of them may use 16/24/32 byte encoding. New +instructions must be multiple of 8 bytes to preserve backward compatibility. + +Internal BPF is a general purpose RISC instruction set. Not every register and +every instruction are used during translation from original BPF to new format. +For example, socket filters are not using ``exclusive add`` instruction, but +tracing filters may do to maintain counters of events, for example. Register R9 +is not used by socket filters either, but more complex filters may be running +out of registers and would have to resort to spill/fill to stack. + +Internal BPF can be used as a generic assembler for last step performance +optimizations, socket filters and seccomp are using it as assembler. Tracing +filters may use it as assembler to generate code from kernel. In kernel usage +may not be bounded by security considerations, since generated internal BPF code +may be optimizing internal code path and not being exposed to the user space. +Safety of internal BPF can come from a verifier (TBD). In such use cases as +described, it may be used as safe instruction set. + +Just like the original BPF, the new format runs within a controlled environment, +is deterministic and the kernel can easily prove that. The safety of the program +can be determined in two steps: first step does depth-first-search to disallow +loops and other CFG validation; second step starts from the first insn and +descends all possible paths. It simulates execution of every insn and observes +the state change of registers and stack. + +eBPF opcode encoding +-------------------- + +eBPF is reusing most of the opcode encoding from classic to simplify conversion +of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code' +field is divided into three parts:: + + +----------------+--------+--------------------+ + | 4 bits | 1 bit | 3 bits | + | operation code | source | instruction class | + +----------------+--------+--------------------+ + (MSB) (LSB) + +Three LSB bits store instruction class which is one of: + + =================== =============== + Classic BPF classes eBPF classes + =================== =============== + BPF_LD 0x00 BPF_LD 0x00 + BPF_LDX 0x01 BPF_LDX 0x01 + BPF_ST 0x02 BPF_ST 0x02 + BPF_STX 0x03 BPF_STX 0x03 + BPF_ALU 0x04 BPF_ALU 0x04 + BPF_JMP 0x05 BPF_JMP 0x05 + BPF_RET 0x06 BPF_JMP32 0x06 + BPF_MISC 0x07 BPF_ALU64 0x07 + =================== =============== + +When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ... + + :: + + BPF_K 0x00 + BPF_X 0x08 + + * in classic BPF, this means:: + + BPF_SRC(code) == BPF_X - use register X as source operand + BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand + + * in eBPF, this means:: + + BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand + BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand + +... and four MSB bits store operation code. + +If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of:: + + BPF_ADD 0x00 + BPF_SUB 0x10 + BPF_MUL 0x20 + BPF_DIV 0x30 + BPF_OR 0x40 + BPF_AND 0x50 + BPF_LSH 0x60 + BPF_RSH 0x70 + BPF_NEG 0x80 + BPF_MOD 0x90 + BPF_XOR 0xa0 + BPF_MOV 0xb0 /* eBPF only: mov reg to reg */ + BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */ + BPF_END 0xd0 /* eBPF only: endianness conversion */ + +If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of:: + + BPF_JA 0x00 /* BPF_JMP only */ + BPF_JEQ 0x10 + BPF_JGT 0x20 + BPF_JGE 0x30 + BPF_JSET 0x40 + BPF_JNE 0x50 /* eBPF only: jump != */ + BPF_JSGT 0x60 /* eBPF only: signed '>' */ + BPF_JSGE 0x70 /* eBPF only: signed '>=' */ + BPF_CALL 0x80 /* eBPF BPF_JMP only: function call */ + BPF_EXIT 0x90 /* eBPF BPF_JMP only: function return */ + BPF_JLT 0xa0 /* eBPF only: unsigned '<' */ + BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */ + BPF_JSLT 0xc0 /* eBPF only: signed '<' */ + BPF_JSLE 0xd0 /* eBPF only: signed '<=' */ + +So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF +and eBPF. There are only two registers in classic BPF, so it means A += X. +In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly, +BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous +src_reg = (u32) src_reg ^ (u32) imm32 in eBPF. + +Classic BPF is using BPF_MISC class to represent A = X and X = A moves. +eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no +BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean +exactly the same operations as BPF_ALU, but with 64-bit wide operands +instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.: +dst_reg = dst_reg + src_reg + +Classic BPF wastes the whole BPF_RET class to represent a single ``ret`` +operation. Classic BPF_RET | BPF_K means copy imm32 into return register +and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT +in eBPF means function exit only. The eBPF program needs to store return +value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as +BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide +operands for the comparisons instead. + +For load and store instructions the 8-bit 'code' field is divided as:: + + +--------+--------+-------------------+ + | 3 bits | 2 bits | 3 bits | + | mode | size | instruction class | + +--------+--------+-------------------+ + (MSB) (LSB) + +Size modifier is one of ... + +:: + + BPF_W 0x00 /* word */ + BPF_H 0x08 /* half word */ + BPF_B 0x10 /* byte */ + BPF_DW 0x18 /* eBPF only, double word */ + +... which encodes size of load/store operation:: + + B - 1 byte + H - 2 byte + W - 4 byte + DW - 8 byte (eBPF only) + +Mode modifier is one of:: + + BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */ + BPF_ABS 0x20 + BPF_IND 0x40 + BPF_MEM 0x60 + BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */ + BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */ + BPF_XADD 0xc0 /* eBPF only, exclusive add */ + +eBPF has two non-generic instructions: (BPF_ABS | | BPF_LD) and +(BPF_IND | | BPF_LD) which are used to access packet data. + +They had to be carried over from classic to have strong performance of +socket filters running in eBPF interpreter. These instructions can only +be used when interpreter context is a pointer to ``struct sk_buff`` and +have seven implicit operands. Register R6 is an implicit input that must +contain pointer to sk_buff. Register R0 is an implicit output which contains +the data fetched from the packet. Registers R1-R5 are scratch registers +and must not be used to store the data across BPF_ABS | BPF_LD or +BPF_IND | BPF_LD instructions. + +These instructions have implicit program exit condition as well. When +eBPF program is trying to access the data beyond the packet boundary, +the interpreter will abort the execution of the program. JIT compilers +therefore must preserve this property. src_reg and imm32 fields are +explicit inputs to these instructions. + +For example:: + + BPF_IND | BPF_W | BPF_LD means: + + R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32)) + and R1 - R5 were scratched. + +Unlike classic BPF instruction set, eBPF has generic load/store operations:: + + BPF_MEM | | BPF_STX: *(size *) (dst_reg + off) = src_reg + BPF_MEM | | BPF_ST: *(size *) (dst_reg + off) = imm32 + BPF_MEM | | BPF_LDX: dst_reg = *(size *) (src_reg + off) + BPF_XADD | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg + BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg + +Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and +2 byte atomic increments are not supported. + +eBPF has one 16-byte instruction: BPF_LD | BPF_DW | BPF_IMM which consists +of two consecutive ``struct bpf_insn`` 8-byte blocks and interpreted as single +instruction that loads 64-bit immediate value into a dst_reg. +Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads +32-bit immediate value into a register. + +eBPF verifier +------------- +The safety of the eBPF program is determined in two steps. + +First step does DAG check to disallow loops and other CFG validation. +In particular it will detect programs that have unreachable instructions. +(though classic BPF checker allows them) + +Second step starts from the first insn and descends all possible paths. +It simulates execution of every insn and observes the state change of +registers and stack. + +At the start of the program the register R1 contains a pointer to context +and has type PTR_TO_CTX. +If verifier sees an insn that does R2=R1, then R2 has now type +PTR_TO_CTX as well and can be used on the right hand side of expression. +If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=SCALAR_VALUE, +since addition of two valid pointers makes invalid pointer. +(In 'secure' mode verifier will reject any type of pointer arithmetic to make +sure that kernel addresses don't leak to unprivileged users) + +If register was never written to, it's not readable:: + + bpf_mov R0 = R2 + bpf_exit + +will be rejected, since R2 is unreadable at the start of the program. + +After kernel function call, R1-R5 are reset to unreadable and +R0 has a return type of the function. + +Since R6-R9 are callee saved, their state is preserved across the call. + +:: + + bpf_mov R6 = 1 + bpf_call foo + bpf_mov R0 = R6 + bpf_exit + +is a correct program. If there was R1 instead of R6, it would have +been rejected. + +load/store instructions are allowed only with registers of valid types, which +are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked. +For example:: + + bpf_mov R1 = 1 + bpf_mov R2 = 2 + bpf_xadd *(u32 *)(R1 + 3) += R2 + bpf_exit + +will be rejected, since R1 doesn't have a valid pointer type at the time of +execution of instruction bpf_xadd. + +At the start R1 type is PTR_TO_CTX (a pointer to generic ``struct bpf_context``) +A callback is used to customize verifier to restrict eBPF program access to only +certain fields within ctx structure with specified size and alignment. + +For example, the following insn:: + + bpf_ld R0 = *(u32 *)(R6 + 8) + +intends to load a word from address R6 + 8 and store it into R0 +If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know +that offset 8 of size 4 bytes can be accessed for reading, otherwise +the verifier will reject the program. +If R6=PTR_TO_STACK, then access should be aligned and be within +stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8, +so it will fail verification, since it's out of bounds. + +The verifier will allow eBPF program to read data from stack only after +it wrote into it. + +Classic BPF verifier does similar check with M[0-15] memory slots. +For example:: + + bpf_ld R0 = *(u32 *)(R10 - 4) + bpf_exit + +is invalid program. +Though R10 is correct read-only register and has type PTR_TO_STACK +and R10 - 4 is within stack bounds, there were no stores into that location. + +Pointer register spill/fill is tracked as well, since four (R6-R9) +callee saved registers may not be enough for some programs. + +Allowed function calls are customized with bpf_verifier_ops->get_func_proto() +The eBPF verifier will check that registers match argument constraints. +After the call register R0 will be set to return type of the function. + +Function calls is a main mechanism to extend functionality of eBPF programs. +Socket filters may let programs to call one set of functions, whereas tracing +filters may allow completely different set. + +If a function made accessible to eBPF program, it needs to be thought through +from safety point of view. The verifier will guarantee that the function is +called with valid arguments. + +seccomp vs socket filters have different security restrictions for classic BPF. +Seccomp solves this by two stage verifier: classic BPF verifier is followed +by seccomp verifier. In case of eBPF one configurable verifier is shared for +all use cases. + +See details of eBPF verifier in kernel/bpf/verifier.c + +Register value tracking +----------------------- +In order to determine the safety of an eBPF program, the verifier must track +the range of possible values in each register and also in each stack slot. +This is done with ``struct bpf_reg_state``, defined in include/linux/ +bpf_verifier.h, which unifies tracking of scalar and pointer values. Each +register state has a type, which is either NOT_INIT (the register has not been +written to), SCALAR_VALUE (some value which is not usable as a pointer), or a +pointer type. The types of pointers describe their base, as follows: + + + PTR_TO_CTX + Pointer to bpf_context. + CONST_PTR_TO_MAP + Pointer to struct bpf_map. "Const" because arithmetic + on these pointers is forbidden. + PTR_TO_MAP_VALUE + Pointer to the value stored in a map element. + PTR_TO_MAP_VALUE_OR_NULL + Either a pointer to a map value, or NULL; map accesses + (see section 'eBPF maps', below) return this type, + which becomes a PTR_TO_MAP_VALUE when checked != NULL. + Arithmetic on these pointers is forbidden. + PTR_TO_STACK + Frame pointer. + PTR_TO_PACKET + skb->data. + PTR_TO_PACKET_END + skb->data + headlen; arithmetic forbidden. + PTR_TO_SOCKET + Pointer to struct bpf_sock_ops, implicitly refcounted. + PTR_TO_SOCKET_OR_NULL + Either a pointer to a socket, or NULL; socket lookup + returns this type, which becomes a PTR_TO_SOCKET when + checked != NULL. PTR_TO_SOCKET is reference-counted, + so programs must release the reference through the + socket release function before the end of the program. + Arithmetic on these pointers is forbidden. + +However, a pointer may be offset from this base (as a result of pointer +arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable +offset'. The former is used when an exactly-known value (e.g. an immediate +operand) is added to a pointer, while the latter is used for values which are +not exactly known. The variable offset is also used in SCALAR_VALUEs, to track +the range of possible values in the register. + +The verifier's knowledge about the variable offset consists of: + +* minimum and maximum values as unsigned +* minimum and maximum values as signed + +* knowledge of the values of individual bits, in the form of a 'tnum': a u64 + 'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown; + 1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both + mask and value; no bit should ever be 1 in both. For example, if a byte is read + into a register from memory, the register's top 56 bits are known zero, while + the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we + then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0; + 0x1ff), because of potential carries. + +Besides arithmetic, the register state can also be updated by conditional +branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch +it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false' +branch it will have a umax_value of 8. A signed compare (with BPF_JSGT or +BPF_JSGE) would instead update the signed minimum/maximum values. Information +from the signed and unsigned bounds can be combined; for instance if a value is +first tested < 8 and then tested s> 4, the verifier will conclude that the value +is also > 4 and s< 8, since the bounds prevent crossing the sign boundary. + +PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all +pointers sharing that same variable offset. This is important for packet range +checks: after adding a variable to a packet pointer register A, if you then copy +it to another register B and then add a constant 4 to A, both registers will +share the same 'id' but the A will have a fixed offset of +4. Then if A is +bounds-checked and found to be less than a PTR_TO_PACKET_END, the register B is +now known to have a safe range of at least 4 bytes. See 'Direct packet access', +below, for more on PTR_TO_PACKET ranges. + +The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of +the pointer returned from a map lookup. This means that when one copy is +checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs. +As well as range-checking, the tracked information is also used for enforcing +alignment of pointer accesses. For instance, on most systems the packet pointer +is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump +over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting +pointer will have a variable offset known to be 4n+2 for some n, so adding the 2 +bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through +that pointer are safe. +The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, common +to all copies of the pointer returned from a socket lookup. This has similar +behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but +it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly +represents a reference to the corresponding ``struct sock``. To ensure that the +reference is not leaked, it is imperative to NULL-check the reference and in +the non-NULL case, and pass the valid reference to the socket release function. + +Direct packet access +-------------------- +In cls_bpf and act_bpf programs the verifier allows direct access to the packet +data via skb->data and skb->data_end pointers. +Ex:: + + 1: r4 = *(u32 *)(r1 +80) /* load skb->data_end */ + 2: r3 = *(u32 *)(r1 +76) /* load skb->data */ + 3: r5 = r3 + 4: r5 += 14 + 5: if r5 > r4 goto pc+16 + R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp + 6: r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */ + +this 2byte load from the packet is safe to do, since the program author +did check ``if (skb->data + 14 > skb->data_end) goto err`` at insn #5 which +means that in the fall-through case the register R3 (which points to skb->data) +has at least 14 directly accessible bytes. The verifier marks it +as R3=pkt(id=0,off=0,r=14). +id=0 means that no additional variables were added to the register. +off=0 means that no additional constants were added. +r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok. +Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points +to the packet data, but constant 14 was added to the register, so +it now points to ``skb->data + 14`` and accessible range is [R5, R5 + 14 - 14) +which is zero bytes. + +More complex packet access may look like:: + + + R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp + 6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */ + 7: r4 = *(u8 *)(r3 +12) + 8: r4 *= 14 + 9: r3 = *(u32 *)(r1 +76) /* load skb->data */ + 10: r3 += r4 + 11: r2 = r1 + 12: r2 <<= 48 + 13: r2 >>= 48 + 14: r3 += r2 + 15: r2 = r3 + 16: r2 += 8 + 17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */ + 18: if r2 > r1 goto pc+2 + R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp + 19: r1 = *(u8 *)(r3 +4) + +The state of the register R3 is R3=pkt(id=2,off=0,r=8) +id=2 means that two ``r3 += rX`` instructions were seen, so r3 points to some +offset within a packet and since the program author did +``if (r3 + 8 > r1) goto err`` at insn #18, the safe range is [R3, R3 + 8). +The verifier only allows 'add'/'sub' operations on packet registers. Any other +operation will set the register state to 'SCALAR_VALUE' and it won't be +available for direct packet access. + +Operation ``r3 += rX`` may overflow and become less than original skb->data, +therefore the verifier has to prevent that. So when it sees ``r3 += rX`` +instruction and rX is more than 16-bit value, any subsequent bounds-check of r3 +against skb->data_end will not give us 'range' information, so attempts to read +through the pointer will give "invalid access to packet" error. + +Ex. after insn ``r4 = *(u8 *)(r3 +12)`` (insn #7 above) the state of r4 is +R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits +of the register are guaranteed to be zero, and nothing is known about the lower +8 bits. After insn ``r4 *= 14`` the state becomes +R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit +value by constant 14 will keep upper 52 bits as zero, also the least significant +bit will be zero as 14 is even. Similarly ``r2 >>= 48`` will make +R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign +extending. This logic is implemented in adjust_reg_min_max_vals() function, +which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice +versa) and adjust_scalar_min_max_vals() for operations on two scalars. + +The end result is that bpf program author can access packet directly +using normal C code as:: + + void *data = (void *)(long)skb->data; + void *data_end = (void *)(long)skb->data_end; + struct eth_hdr *eth = data; + struct iphdr *iph = data + sizeof(*eth); + struct udphdr *udp = data + sizeof(*eth) + sizeof(*iph); + + if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end) + return 0; + if (eth->h_proto != htons(ETH_P_IP)) + return 0; + if (iph->protocol != IPPROTO_UDP || iph->ihl != 5) + return 0; + if (udp->dest == 53 || udp->source == 9) + ...; + +which makes such programs easier to write comparing to LD_ABS insn +and significantly faster. + +eBPF maps +--------- +'maps' is a generic storage of different types for sharing data between kernel +and userspace. + +The maps are accessed from user space via BPF syscall, which has commands: + +- create a map with given type and attributes + ``map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)`` + using attr->map_type, attr->key_size, attr->value_size, attr->max_entries + returns process-local file descriptor or negative error + +- lookup key in a given map + ``err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)`` + using attr->map_fd, attr->key, attr->value + returns zero and stores found elem into value or negative error + +- create or update key/value pair in a given map + ``err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)`` + using attr->map_fd, attr->key, attr->value + returns zero or negative error + +- find and delete element by key in a given map + ``err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)`` + using attr->map_fd, attr->key + +- to delete map: close(fd) + Exiting process will delete maps automatically + +userspace programs use this syscall to create/access maps that eBPF programs +are concurrently updating. + +maps can have different types: hash, array, bloom filter, radix-tree, etc. + +The map is defined by: + + - type + - max number of elements + - key size in bytes + - value size in bytes + +Pruning +------- +The verifier does not actually walk all possible paths through the program. For +each new branch to analyse, the verifier looks at all the states it's previously +been in when at this instruction. If any of them contain the current state as a +subset, the branch is 'pruned' - that is, the fact that the previous state was +accepted implies the current state would be as well. For instance, if in the +previous state, r1 held a packet-pointer, and in the current state, r1 holds a +packet-pointer with a range as long or longer and at least as strict an +alignment, then r1 is safe. Similarly, if r2 was NOT_INIT before then it can't +have been used by any path from that point, so any value in r2 (including +another NOT_INIT) is safe. The implementation is in the function regsafe(). +Pruning considers not only the registers but also the stack (and any spilled +registers it may hold). They must all be safe for the branch to be pruned. +This is implemented in states_equal(). + +Understanding eBPF verifier messages +------------------------------------ + +The following are few examples of invalid eBPF programs and verifier error +messages as seen in the log: + +Program with unreachable instructions:: + + static struct bpf_insn prog[] = { + BPF_EXIT_INSN(), + BPF_EXIT_INSN(), + }; + +Error: + + unreachable insn 1 + +Program that reads uninitialized register:: + + BPF_MOV64_REG(BPF_REG_0, BPF_REG_2), + BPF_EXIT_INSN(), + +Error:: + + 0: (bf) r0 = r2 + R2 !read_ok + +Program that doesn't initialize R0 before exiting:: + + BPF_MOV64_REG(BPF_REG_2, BPF_REG_1), + BPF_EXIT_INSN(), + +Error:: + + 0: (bf) r2 = r1 + 1: (95) exit + R0 !read_ok + +Program that accesses stack out of bounds:: + + BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0), + BPF_EXIT_INSN(), + +Error:: + + 0: (7a) *(u64 *)(r10 +8) = 0 + invalid stack off=8 size=8 + +Program that doesn't initialize stack before passing its address into function:: + + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_EXIT_INSN(), + +Error:: + + 0: (bf) r2 = r10 + 1: (07) r2 += -8 + 2: (b7) r1 = 0x0 + 3: (85) call 1 + invalid indirect read from stack off -8+0 size 8 + +Program that uses invalid map_fd=0 while calling to map_lookup_elem() function:: + + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_EXIT_INSN(), + +Error:: + + 0: (7a) *(u64 *)(r10 -8) = 0 + 1: (bf) r2 = r10 + 2: (07) r2 += -8 + 3: (b7) r1 = 0x0 + 4: (85) call 1 + fd 0 is not pointing to valid bpf_map + +Program that doesn't check return value of map_lookup_elem() before accessing +map element:: + + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), + BPF_EXIT_INSN(), + +Error:: + + 0: (7a) *(u64 *)(r10 -8) = 0 + 1: (bf) r2 = r10 + 2: (07) r2 += -8 + 3: (b7) r1 = 0x0 + 4: (85) call 1 + 5: (7a) *(u64 *)(r0 +0) = 0 + R0 invalid mem access 'map_value_or_null' + +Program that correctly checks map_lookup_elem() returned value for NULL, but +accesses the memory with incorrect alignment:: + + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1), + BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0), + BPF_EXIT_INSN(), + +Error:: + + 0: (7a) *(u64 *)(r10 -8) = 0 + 1: (bf) r2 = r10 + 2: (07) r2 += -8 + 3: (b7) r1 = 1 + 4: (85) call 1 + 5: (15) if r0 == 0x0 goto pc+1 + R0=map_ptr R10=fp + 6: (7a) *(u64 *)(r0 +4) = 0 + misaligned access off 4 size 8 + +Program that correctly checks map_lookup_elem() returned value for NULL and +accesses memory with correct alignment in one side of 'if' branch, but fails +to do so in the other side of 'if' branch:: + + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), + BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), + BPF_EXIT_INSN(), + BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1), + BPF_EXIT_INSN(), + +Error:: + + 0: (7a) *(u64 *)(r10 -8) = 0 + 1: (bf) r2 = r10 + 2: (07) r2 += -8 + 3: (b7) r1 = 1 + 4: (85) call 1 + 5: (15) if r0 == 0x0 goto pc+2 + R0=map_ptr R10=fp + 6: (7a) *(u64 *)(r0 +0) = 0 + 7: (95) exit + + from 5 to 8: R0=imm0 R10=fp + 8: (7a) *(u64 *)(r0 +0) = 1 + R0 invalid mem access 'imm' + +Program that performs a socket lookup then sets the pointer to NULL without +checking it:: + + BPF_MOV64_IMM(BPF_REG_2, 0), + BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_MOV64_IMM(BPF_REG_3, 4), + BPF_MOV64_IMM(BPF_REG_4, 0), + BPF_MOV64_IMM(BPF_REG_5, 0), + BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + +Error:: + + 0: (b7) r2 = 0 + 1: (63) *(u32 *)(r10 -8) = r2 + 2: (bf) r2 = r10 + 3: (07) r2 += -8 + 4: (b7) r3 = 4 + 5: (b7) r4 = 0 + 6: (b7) r5 = 0 + 7: (85) call bpf_sk_lookup_tcp#65 + 8: (b7) r0 = 0 + 9: (95) exit + Unreleased reference id=1, alloc_insn=7 + +Program that performs a socket lookup but does not NULL-check the returned +value:: + + BPF_MOV64_IMM(BPF_REG_2, 0), + BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_MOV64_IMM(BPF_REG_3, 4), + BPF_MOV64_IMM(BPF_REG_4, 0), + BPF_MOV64_IMM(BPF_REG_5, 0), + BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), + BPF_EXIT_INSN(), + +Error:: + + 0: (b7) r2 = 0 + 1: (63) *(u32 *)(r10 -8) = r2 + 2: (bf) r2 = r10 + 3: (07) r2 += -8 + 4: (b7) r3 = 4 + 5: (b7) r4 = 0 + 6: (b7) r5 = 0 + 7: (85) call bpf_sk_lookup_tcp#65 + 8: (95) exit + Unreleased reference id=1, alloc_insn=7 + +Testing +------- + +Next to the BPF toolchain, the kernel also ships a test module that contains +various test cases for classic and internal BPF that can be executed against +the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and +enabled via Kconfig:: + + CONFIG_TEST_BPF=m + +After the module has been built and installed, the test suite can be executed +via insmod or modprobe against 'test_bpf' module. Results of the test cases +including timings in nsec can be found in the kernel log (dmesg). + +Misc +---- + +Also trinity, the Linux syscall fuzzer, has built-in support for BPF and +SECCOMP-BPF kernel fuzzing. + +Written by +---------- + +The document was written in the hope that it is found useful and in order +to give potential BPF hackers or security auditors a better overview of +the underlying architecture. + +- Jay Schulist +- Daniel Borkmann +- Alexei Starovoitov diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt deleted file mode 100644 index 2f0f8b17dade..000000000000 --- a/Documentation/networking/filter.txt +++ /dev/null @@ -1,1545 +0,0 @@ -Linux Socket Filtering aka Berkeley Packet Filter (BPF) -======================================================= - -Introduction ------------- - -Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter. -Though there are some distinct differences between the BSD and Linux -Kernel filtering, but when we speak of BPF or LSF in Linux context, we -mean the very same mechanism of filtering in the Linux kernel. - -BPF allows a user-space program to attach a filter onto any socket and -allow or disallow certain types of data to come through the socket. LSF -follows exactly the same filter code structure as BSD's BPF, so referring -to the BSD bpf.4 manpage is very helpful in creating filters. - -On Linux, BPF is much simpler than on BSD. One does not have to worry -about devices or anything like that. You simply create your filter code, -send it to the kernel via the SO_ATTACH_FILTER option and if your filter -code passes the kernel check on it, you then immediately begin filtering -data on that socket. - -You can also detach filters from your socket via the SO_DETACH_FILTER -option. This will probably not be used much since when you close a socket -that has a filter on it the filter is automagically removed. The other -less common case may be adding a different filter on the same socket where -you had another filter that is still running: the kernel takes care of -removing the old one and placing your new one in its place, assuming your -filter has passed the checks, otherwise if it fails the old filter will -remain on that socket. - -SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once -set, a filter cannot be removed or changed. This allows one process to -setup a socket, attach a filter, lock it then drop privileges and be -assured that the filter will be kept until the socket is closed. - -The biggest user of this construct might be libpcap. Issuing a high-level -filter command like `tcpdump -i em1 port 22` passes through the libpcap -internal compiler that generates a structure that can eventually be loaded -via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd` -displays what is being placed into this structure. - -Although we were only speaking about sockets here, BPF in Linux is used -in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel -qdisc layer, SECCOMP-BPF (SECure COMPuting [1]), and lots of other places -such as team driver, PTP code, etc where BPF is being used. - - [1] Documentation/userspace-api/seccomp_filter.rst - -Original BPF paper: - -Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new -architecture for user-level packet capture. In Proceedings of the -USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993 -Conference Proceedings (USENIX'93). USENIX Association, Berkeley, -CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf] - -Structure ---------- - -User space applications include which contains the -following relevant structures: - -struct sock_filter { /* Filter block */ - __u16 code; /* Actual filter code */ - __u8 jt; /* Jump true */ - __u8 jf; /* Jump false */ - __u32 k; /* Generic multiuse field */ -}; - -Such a structure is assembled as an array of 4-tuples, that contains -a code, jt, jf and k value. jt and jf are jump offsets and k a generic -value to be used for a provided code. - -struct sock_fprog { /* Required for SO_ATTACH_FILTER. */ - unsigned short len; /* Number of filter blocks */ - struct sock_filter __user *filter; -}; - -For socket filtering, a pointer to this structure (as shown in -follow-up example) is being passed to the kernel through setsockopt(2). - -Example -------- - -#include -#include -#include -#include -/* ... */ - -/* From the example above: tcpdump -i em1 port 22 -dd */ -struct sock_filter code[] = { - { 0x28, 0, 0, 0x0000000c }, - { 0x15, 0, 8, 0x000086dd }, - { 0x30, 0, 0, 0x00000014 }, - { 0x15, 2, 0, 0x00000084 }, - { 0x15, 1, 0, 0x00000006 }, - { 0x15, 0, 17, 0x00000011 }, - { 0x28, 0, 0, 0x00000036 }, - { 0x15, 14, 0, 0x00000016 }, - { 0x28, 0, 0, 0x00000038 }, - { 0x15, 12, 13, 0x00000016 }, - { 0x15, 0, 12, 0x00000800 }, - { 0x30, 0, 0, 0x00000017 }, - { 0x15, 2, 0, 0x00000084 }, - { 0x15, 1, 0, 0x00000006 }, - { 0x15, 0, 8, 0x00000011 }, - { 0x28, 0, 0, 0x00000014 }, - { 0x45, 6, 0, 0x00001fff }, - { 0xb1, 0, 0, 0x0000000e }, - { 0x48, 0, 0, 0x0000000e }, - { 0x15, 2, 0, 0x00000016 }, - { 0x48, 0, 0, 0x00000010 }, - { 0x15, 0, 1, 0x00000016 }, - { 0x06, 0, 0, 0x0000ffff }, - { 0x06, 0, 0, 0x00000000 }, -}; - -struct sock_fprog bpf = { - .len = ARRAY_SIZE(code), - .filter = code, -}; - -sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); -if (sock < 0) - /* ... bail out ... */ - -ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf)); -if (ret < 0) - /* ... bail out ... */ - -/* ... */ -close(sock); - -The above example code attaches a socket filter for a PF_PACKET socket -in order to let all IPv4/IPv6 packets with port 22 pass. The rest will -be dropped for this socket. - -The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments -and SO_LOCK_FILTER for preventing the filter to be detached, takes an -integer value with 0 or 1. - -Note that socket filters are not restricted to PF_PACKET sockets only, -but can also be used on other socket families. - -Summary of system calls: - - * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val)); - * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val)); - * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER, &val, sizeof(val)); - -Normally, most use cases for socket filtering on packet sockets will be -covered by libpcap in high-level syntax, so as an application developer -you should stick to that. libpcap wraps its own layer around all that. - -Unless i) using/linking to libpcap is not an option, ii) the required BPF -filters use Linux extensions that are not supported by libpcap's compiler, -iii) a filter might be more complex and not cleanly implementable with -libpcap's compiler, or iv) particular filter codes should be optimized -differently than libpcap's internal compiler does; then in such cases -writing such a filter "by hand" can be of an alternative. For example, -xt_bpf and cls_bpf users might have requirements that could result in -more complex filter code, or one that cannot be expressed with libpcap -(e.g. different return codes for various code paths). Moreover, BPF JIT -implementors may wish to manually write test cases and thus need low-level -access to BPF code as well. - -BPF engine and instruction set ------------------------------- - -Under tools/bpf/ there's a small helper tool called bpf_asm which can -be used to write low-level filters for example scenarios mentioned in the -previous section. Asm-like syntax mentioned here has been implemented in -bpf_asm and will be used for further explanations (instead of dealing with -less readable opcodes directly, principles are the same). The syntax is -closely modelled after Steven McCanne's and Van Jacobson's BPF paper. - -The BPF architecture consists of the following basic elements: - - Element Description - - A 32 bit wide accumulator - X 32 bit wide X register - M[] 16 x 32 bit wide misc registers aka "scratch memory - store", addressable from 0 to 15 - -A program, that is translated by bpf_asm into "opcodes" is an array that -consists of the following elements (as already mentioned): - - op:16, jt:8, jf:8, k:32 - -The element op is a 16 bit wide opcode that has a particular instruction -encoded. jt and jf are two 8 bit wide jump targets, one for condition -"jump if true", the other one "jump if false". Eventually, element k -contains a miscellaneous argument that can be interpreted in different -ways depending on the given instruction in op. - -The instruction set consists of load, store, branch, alu, miscellaneous -and return instructions that are also represented in bpf_asm syntax. This -table lists all bpf_asm instructions available resp. what their underlying -opcodes as defined in linux/filter.h stand for: - - Instruction Addressing mode Description - - ld 1, 2, 3, 4, 12 Load word into A - ldi 4 Load word into A - ldh 1, 2 Load half-word into A - ldb 1, 2 Load byte into A - ldx 3, 4, 5, 12 Load word into X - ldxi 4 Load word into X - ldxb 5 Load byte into X - - st 3 Store A into M[] - stx 3 Store X into M[] - - jmp 6 Jump to label - ja 6 Jump to label - jeq 7, 8, 9, 10 Jump on A == - jneq 9, 10 Jump on A != - jne 9, 10 Jump on A != - jlt 9, 10 Jump on A < - jle 9, 10 Jump on A <= - jgt 7, 8, 9, 10 Jump on A > - jge 7, 8, 9, 10 Jump on A >= - jset 7, 8, 9, 10 Jump on A & - - add 0, 4 A + - sub 0, 4 A - - mul 0, 4 A * - div 0, 4 A / - mod 0, 4 A % - neg !A - and 0, 4 A & - or 0, 4 A | - xor 0, 4 A ^ - lsh 0, 4 A << - rsh 0, 4 A >> - - tax Copy A into X - txa Copy X into A - - ret 4, 11 Return - -The next table shows addressing formats from the 2nd column: - - Addressing mode Syntax Description - - 0 x/%x Register X - 1 [k] BHW at byte offset k in the packet - 2 [x + k] BHW at the offset X + k in the packet - 3 M[k] Word at offset k in M[] - 4 #k Literal value stored in k - 5 4*([k]&0xf) Lower nibble * 4 at byte offset k in the packet - 6 L Jump label L - 7 #k,Lt,Lf Jump to Lt if true, otherwise jump to Lf - 8 x/%x,Lt,Lf Jump to Lt if true, otherwise jump to Lf - 9 #k,Lt Jump to Lt if predicate is true - 10 x/%x,Lt Jump to Lt if predicate is true - 11 a/%a Accumulator A - 12 extension BPF extension - -The Linux kernel also has a couple of BPF extensions that are used along -with the class of load instructions by "overloading" the k argument with -a negative offset + a particular extension offset. The result of such BPF -extensions are loaded into A. - -Possible BPF extensions are shown in the following table: - - Extension Description - - len skb->len - proto skb->protocol - type skb->pkt_type - poff Payload start offset - ifidx skb->dev->ifindex - nla Netlink attribute of type X with offset A - nlan Nested Netlink attribute of type X with offset A - mark skb->mark - queue skb->queue_mapping - hatype skb->dev->type - rxhash skb->hash - cpu raw_smp_processor_id() - vlan_tci skb_vlan_tag_get(skb) - vlan_avail skb_vlan_tag_present(skb) - vlan_tpid skb->vlan_proto - rand prandom_u32() - -These extensions can also be prefixed with '#'. -Examples for low-level BPF: - -** ARP packets: - - ldh [12] - jne #0x806, drop - ret #-1 - drop: ret #0 - -** IPv4 TCP packets: - - ldh [12] - jne #0x800, drop - ldb [23] - jneq #6, drop - ret #-1 - drop: ret #0 - -** (Accelerated) VLAN w/ id 10: - - ld vlan_tci - jneq #10, drop - ret #-1 - drop: ret #0 - -** icmp random packet sampling, 1 in 4 - ldh [12] - jne #0x800, drop - ldb [23] - jneq #1, drop - # get a random uint32 number - ld rand - mod #4 - jneq #1, drop - ret #-1 - drop: ret #0 - -** SECCOMP filter example: - - ld [4] /* offsetof(struct seccomp_data, arch) */ - jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */ - ld [0] /* offsetof(struct seccomp_data, nr) */ - jeq #15, good /* __NR_rt_sigreturn */ - jeq #231, good /* __NR_exit_group */ - jeq #60, good /* __NR_exit */ - jeq #0, good /* __NR_read */ - jeq #1, good /* __NR_write */ - jeq #5, good /* __NR_fstat */ - jeq #9, good /* __NR_mmap */ - jeq #14, good /* __NR_rt_sigprocmask */ - jeq #13, good /* __NR_rt_sigaction */ - jeq #35, good /* __NR_nanosleep */ - bad: ret #0 /* SECCOMP_RET_KILL_THREAD */ - good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */ - -The above example code can be placed into a file (here called "foo"), and -then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf -and cls_bpf understands and can directly be loaded with. Example with above -ARP code: - -$ ./bpf_asm foo -4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0, - -In copy and paste C-like output: - -$ ./bpf_asm -c foo -{ 0x28, 0, 0, 0x0000000c }, -{ 0x15, 0, 1, 0x00000806 }, -{ 0x06, 0, 0, 0xffffffff }, -{ 0x06, 0, 0, 0000000000 }, - -In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF -filters that might not be obvious at first, it's good to test filters before -attaching to a live system. For that purpose, there's a small tool called -bpf_dbg under tools/bpf/ in the kernel source directory. This debugger allows -for testing BPF filters against given pcap files, single stepping through the -BPF code on the pcap's packets and to do BPF machine register dumps. - -Starting bpf_dbg is trivial and just requires issuing: - -# ./bpf_dbg - -In case input and output do not equal stdin/stdout, bpf_dbg takes an -alternative stdin source as a first argument, and an alternative stdout -sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`. - -Other than that, a particular libreadline configuration can be set via -file "~/.bpf_dbg_init" and the command history is stored in the file -"~/.bpf_dbg_history". - -Interaction in bpf_dbg happens through a shell that also has auto-completion -support (follow-up example commands starting with '>' denote bpf_dbg shell). -The usual workflow would be to ... - -> load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0 - Loads a BPF filter from standard output of bpf_asm, or transformed via - e.g. `tcpdump -iem1 -ddd port 22 | tr '\n' ','`. Note that for JIT - debugging (next section), this command creates a temporary socket and - loads the BPF code into the kernel. Thus, this will also be useful for - JIT developers. - -> load pcap foo.pcap - Loads standard tcpdump pcap file. - -> run [] -bpf passes:1 fails:9 - Runs through all packets from a pcap to account how many passes and fails - the filter will generate. A limit of packets to traverse can be given. - -> disassemble -l0: ldh [12] -l1: jeq #0x800, l2, l5 -l2: ldb [23] -l3: jeq #0x1, l4, l5 -l4: ret #0xffff -l5: ret #0 - Prints out BPF code disassembly. - -> dump -/* { op, jt, jf, k }, */ -{ 0x28, 0, 0, 0x0000000c }, -{ 0x15, 0, 3, 0x00000800 }, -{ 0x30, 0, 0, 0x00000017 }, -{ 0x15, 0, 1, 0x00000001 }, -{ 0x06, 0, 0, 0x0000ffff }, -{ 0x06, 0, 0, 0000000000 }, - Prints out C-style BPF code dump. - -> breakpoint 0 -breakpoint at: l0: ldh [12] -> breakpoint 1 -breakpoint at: l1: jeq #0x800, l2, l5 - ... - Sets breakpoints at particular BPF instructions. Issuing a `run` command - will walk through the pcap file continuing from the current packet and - break when a breakpoint is being hit (another `run` will continue from - the currently active breakpoint executing next instructions): - - > run - -- register dump -- - pc: [0] <-- program counter - code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction - curr: l0: ldh [12] <-- disassembly of current instruction - A: [00000000][0] <-- content of A (hex, decimal) - X: [00000000][0] <-- content of X (hex, decimal) - M[0,15]: [00000000][0] <-- folded content of M (hex, decimal) - -- packet dump -- <-- Current packet from pcap (hex) - len: 42 - 0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01 - 16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26 - 32: 00 00 00 00 00 00 0a 3b 01 01 - (breakpoint) - > - -> breakpoint -breakpoints: 0 1 - Prints currently set breakpoints. - -> step [-, +] - Performs single stepping through the BPF program from the current pc - offset. Thus, on each step invocation, above register dump is issued. - This can go forwards and backwards in time, a plain `step` will break - on the next BPF instruction, thus +1. (No `run` needs to be issued here.) - -> select - Selects a given packet from the pcap file to continue from. Thus, on - the next `run` or `step`, the BPF program is being evaluated against - the user pre-selected packet. Numbering starts just as in Wireshark - with index 1. - -> quit -# - Exits bpf_dbg. - -JIT compiler ------------- - -The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, -PowerPC, ARM, ARM64, MIPS, RISC-V and s390 and can be enabled through -CONFIG_BPF_JIT. The JIT compiler is transparently invoked for each -attached filter from user space or for internal kernel users if it has -been previously enabled by root: - - echo 1 > /proc/sys/net/core/bpf_jit_enable - -For JIT developers, doing audits etc, each compile run can output the generated -opcode image into the kernel log via: - - echo 2 > /proc/sys/net/core/bpf_jit_enable - -Example output from dmesg: - -[ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f -[ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68 -[ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00 -[ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00 -[ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00 -[ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3 - -When CONFIG_BPF_JIT_ALWAYS_ON is enabled, bpf_jit_enable is permanently set to 1 and -setting any other value than that will return in failure. This is even the case for -setting bpf_jit_enable to 2, since dumping the final JIT image into the kernel log -is discouraged and introspection through bpftool (under tools/bpf/bpftool/) is the -generally recommended approach instead. - -In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for -generating disassembly out of the kernel log's hexdump: - -# ./bpf_jit_disasm -70 bytes emitted from JIT compiler (pass:3, flen:6) -ffffffffa0069c8f + : - 0: push %rbp - 1: mov %rsp,%rbp - 4: sub $0x60,%rsp - 8: mov %rbx,-0x8(%rbp) - c: mov 0x68(%rdi),%r9d - 10: sub 0x6c(%rdi),%r9d - 14: mov 0xd8(%rdi),%r8 - 1b: mov $0xc,%esi - 20: callq 0xffffffffe0ff9442 - 25: cmp $0x800,%eax - 2a: jne 0x0000000000000042 - 2c: mov $0x17,%esi - 31: callq 0xffffffffe0ff945e - 36: cmp $0x1,%eax - 39: jne 0x0000000000000042 - 3b: mov $0xffff,%eax - 40: jmp 0x0000000000000044 - 42: xor %eax,%eax - 44: leaveq - 45: retq - -Issuing option `-o` will "annotate" opcodes to resulting assembler -instructions, which can be very useful for JIT developers: - -# ./bpf_jit_disasm -o -70 bytes emitted from JIT compiler (pass:3, flen:6) -ffffffffa0069c8f + : - 0: push %rbp - 55 - 1: mov %rsp,%rbp - 48 89 e5 - 4: sub $0x60,%rsp - 48 83 ec 60 - 8: mov %rbx,-0x8(%rbp) - 48 89 5d f8 - c: mov 0x68(%rdi),%r9d - 44 8b 4f 68 - 10: sub 0x6c(%rdi),%r9d - 44 2b 4f 6c - 14: mov 0xd8(%rdi),%r8 - 4c 8b 87 d8 00 00 00 - 1b: mov $0xc,%esi - be 0c 00 00 00 - 20: callq 0xffffffffe0ff9442 - e8 1d 94 ff e0 - 25: cmp $0x800,%eax - 3d 00 08 00 00 - 2a: jne 0x0000000000000042 - 75 16 - 2c: mov $0x17,%esi - be 17 00 00 00 - 31: callq 0xffffffffe0ff945e - e8 28 94 ff e0 - 36: cmp $0x1,%eax - 83 f8 01 - 39: jne 0x0000000000000042 - 75 07 - 3b: mov $0xffff,%eax - b8 ff ff 00 00 - 40: jmp 0x0000000000000044 - eb 02 - 42: xor %eax,%eax - 31 c0 - 44: leaveq - c9 - 45: retq - c3 - -For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful -toolchain for developing and testing the kernel's JIT compiler. - -BPF kernel internals --------------------- -Internally, for the kernel interpreter, a different instruction set -format with similar underlying principles from BPF described in previous -paragraphs is being used. However, the instruction set format is modelled -closer to the underlying architecture to mimic native instruction sets, so -that a better performance can be achieved (more details later). This new -ISA is called 'eBPF' or 'internal BPF' interchangeably. (Note: eBPF which -originates from [e]xtended BPF is not the same as BPF extensions! While -eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading' -of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.) - -It is designed to be JITed with one to one mapping, which can also open up -the possibility for GCC/LLVM compilers to generate optimized eBPF code through -an eBPF backend that performs almost as fast as natively compiled code. - -The new instruction set was originally designed with the possible goal in -mind to write programs in "restricted C" and compile into eBPF with a optional -GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with -minimal performance overhead over two steps, that is, C -> eBPF -> native code. - -Currently, the new format is being used for running user BPF programs, which -includes seccomp BPF, classic socket filters, cls_bpf traffic classifier, -team driver's classifier for its load-balancing mode, netfilter's xt_bpf -extension, PTP dissector/classifier, and much more. They are all internally -converted by the kernel into the new instruction set representation and run -in the eBPF interpreter. For in-kernel handlers, this all works transparently -by using bpf_prog_create() for setting up the filter, resp. -bpf_prog_destroy() for destroying it. The macro -BPF_PROG_RUN(filter, ctx) transparently invokes eBPF interpreter or JITed -code to run the filter. 'filter' is a pointer to struct bpf_prog that we -got from bpf_prog_create(), and 'ctx' the given context (e.g. -skb pointer). All constraints and restrictions from bpf_check_classic() apply -before a conversion to the new layout is being done behind the scenes! - -Currently, the classic BPF format is being used for JITing on most -32-bit architectures, whereas x86-64, aarch64, s390x, powerpc64, -sparc64, arm32, riscv64, riscv32 perform JIT compilation from eBPF -instruction set. - -Some core changes of the new internal format: - -- Number of registers increase from 2 to 10: - - The old format had two registers A and X, and a hidden frame pointer. The - new layout extends this to be 10 internal registers and a read-only frame - pointer. Since 64-bit CPUs are passing arguments to functions via registers - the number of args from eBPF program to in-kernel function is restricted - to 5 and one register is used to accept return value from an in-kernel - function. Natively, x86_64 passes first 6 arguments in registers, aarch64/ - sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved - registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers. - - Therefore, eBPF calling convention is defined as: - - * R0 - return value from in-kernel function, and exit value for eBPF program - * R1 - R5 - arguments from eBPF program to in-kernel function - * R6 - R9 - callee saved registers that in-kernel function will preserve - * R10 - read-only frame pointer to access stack - - Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64, - etc, and eBPF calling convention maps directly to ABIs used by the kernel on - 64-bit architectures. - - On 32-bit architectures JIT may map programs that use only 32-bit arithmetic - and may let more complex programs to be interpreted. - - R0 - R5 are scratch registers and eBPF program needs spill/fill them if - necessary across calls. Note that there is only one eBPF program (== one - eBPF main routine) and it cannot call other eBPF functions, it can only - call predefined in-kernel functions, though. - -- Register width increases from 32-bit to 64-bit: - - Still, the semantics of the original 32-bit ALU operations are preserved - via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower - subregisters that zero-extend into 64-bit if they are being written to. - That behavior maps directly to x86_64 and arm64 subregister definition, but - makes other JITs more difficult. - - 32-bit architectures run 64-bit internal BPF programs via interpreter. - Their JITs may convert BPF programs that only use 32-bit subregisters into - native instruction set and let the rest being interpreted. - - Operation is 64-bit, because on 64-bit architectures, pointers are also - 64-bit wide, and we want to pass 64-bit values in/out of kernel functions, - so 32-bit eBPF registers would otherwise require to define register-pair - ABI, thus, there won't be able to use a direct eBPF register to HW register - mapping and JIT would need to do combine/split/move operations for every - register in and out of the function, which is complex, bug prone and slow. - Another reason is the use of atomic 64-bit counters. - -- Conditional jt/jf targets replaced with jt/fall-through: - - While the original design has constructs such as "if (cond) jump_true; - else jump_false;", they are being replaced into alternative constructs like - "if (cond) jump_true; /* else fall-through */". - -- Introduces bpf_call insn and register passing convention for zero overhead - calls from/to other kernel functions: - - Before an in-kernel function call, the internal BPF program needs to - place function arguments into R1 to R5 registers to satisfy calling - convention, then the interpreter will take them from registers and pass - to in-kernel function. If R1 - R5 registers are mapped to CPU registers - that are used for argument passing on given architecture, the JIT compiler - doesn't need to emit extra moves. Function arguments will be in the correct - registers and BPF_CALL instruction will be JITed as single 'call' HW - instruction. This calling convention was picked to cover common call - situations without performance penalty. - - After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has - a return value of the function. Since R6 - R9 are callee saved, their state - is preserved across the call. - - For example, consider three C functions: - - u64 f1() { return (*_f2)(1); } - u64 f2(u64 a) { return f3(a + 1, a); } - u64 f3(u64 a, u64 b) { return a - b; } - - GCC can compile f1, f3 into x86_64: - - f1: - movl $1, %edi - movq _f2(%rip), %rax - jmp *%rax - f3: - movq %rdi, %rax - subq %rsi, %rax - ret - - Function f2 in eBPF may look like: - - f2: - bpf_mov R2, R1 - bpf_add R1, 1 - bpf_call f3 - bpf_exit - - If f2 is JITed and the pointer stored to '_f2'. The calls f1 -> f2 -> f3 and - returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to - be used to call into f2. - - For practical reasons all eBPF programs have only one argument 'ctx' which is - already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs - can call kernel functions with up to 5 arguments. Calls with 6 or more arguments - are currently not supported, but these restrictions can be lifted if necessary - in the future. - - On 64-bit architectures all register map to HW registers one to one. For - example, x86_64 JIT compiler can map them as ... - - R0 - rax - R1 - rdi - R2 - rsi - R3 - rdx - R4 - rcx - R5 - r8 - R6 - rbx - R7 - r13 - R8 - r14 - R9 - r15 - R10 - rbp - - ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing - and rbx, r12 - r15 are callee saved. - - Then the following internal BPF pseudo-program: - - bpf_mov R6, R1 /* save ctx */ - bpf_mov R2, 2 - bpf_mov R3, 3 - bpf_mov R4, 4 - bpf_mov R5, 5 - bpf_call foo - bpf_mov R7, R0 /* save foo() return value */ - bpf_mov R1, R6 /* restore ctx for next call */ - bpf_mov R2, 6 - bpf_mov R3, 7 - bpf_mov R4, 8 - bpf_mov R5, 9 - bpf_call bar - bpf_add R0, R7 - bpf_exit - - After JIT to x86_64 may look like: - - push %rbp - mov %rsp,%rbp - sub $0x228,%rsp - mov %rbx,-0x228(%rbp) - mov %r13,-0x220(%rbp) - mov %rdi,%rbx - mov $0x2,%esi - mov $0x3,%edx - mov $0x4,%ecx - mov $0x5,%r8d - callq foo - mov %rax,%r13 - mov %rbx,%rdi - mov $0x6,%esi - mov $0x7,%edx - mov $0x8,%ecx - mov $0x9,%r8d - callq bar - add %r13,%rax - mov -0x228(%rbp),%rbx - mov -0x220(%rbp),%r13 - leaveq - retq - - Which is in this example equivalent in C to: - - u64 bpf_filter(u64 ctx) - { - return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9); - } - - In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64 - arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper - registers and place their return value into '%rax' which is R0 in eBPF. - Prologue and epilogue are emitted by JIT and are implicit in the - interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve - them across the calls as defined by calling convention. - - For example the following program is invalid: - - bpf_mov R1, 1 - bpf_call foo - bpf_mov R0, R1 - bpf_exit - - After the call the registers R1-R5 contain junk values and cannot be read. - An in-kernel eBPF verifier is used to validate internal BPF programs. - -Also in the new design, eBPF is limited to 4096 insns, which means that any -program will terminate quickly and will only call a fixed number of kernel -functions. Original BPF and the new format are two operand instructions, -which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT. - -The input context pointer for invoking the interpreter function is generic, -its content is defined by a specific use case. For seccomp register R1 points -to seccomp_data, for converted BPF filters R1 points to a skb. - -A program, that is translated internally consists of the following elements: - - op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32 - -So far 87 internal BPF instructions were implemented. 8-bit 'op' opcode field -has room for new instructions. Some of them may use 16/24/32 byte encoding. New -instructions must be multiple of 8 bytes to preserve backward compatibility. - -Internal BPF is a general purpose RISC instruction set. Not every register and -every instruction are used during translation from original BPF to new format. -For example, socket filters are not using 'exclusive add' instruction, but -tracing filters may do to maintain counters of events, for example. Register R9 -is not used by socket filters either, but more complex filters may be running -out of registers and would have to resort to spill/fill to stack. - -Internal BPF can be used as a generic assembler for last step performance -optimizations, socket filters and seccomp are using it as assembler. Tracing -filters may use it as assembler to generate code from kernel. In kernel usage -may not be bounded by security considerations, since generated internal BPF code -may be optimizing internal code path and not being exposed to the user space. -Safety of internal BPF can come from a verifier (TBD). In such use cases as -described, it may be used as safe instruction set. - -Just like the original BPF, the new format runs within a controlled environment, -is deterministic and the kernel can easily prove that. The safety of the program -can be determined in two steps: first step does depth-first-search to disallow -loops and other CFG validation; second step starts from the first insn and -descends all possible paths. It simulates execution of every insn and observes -the state change of registers and stack. - -eBPF opcode encoding --------------------- - -eBPF is reusing most of the opcode encoding from classic to simplify conversion -of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code' -field is divided into three parts: - - +----------------+--------+--------------------+ - | 4 bits | 1 bit | 3 bits | - | operation code | source | instruction class | - +----------------+--------+--------------------+ - (MSB) (LSB) - -Three LSB bits store instruction class which is one of: - - Classic BPF classes: eBPF classes: - - BPF_LD 0x00 BPF_LD 0x00 - BPF_LDX 0x01 BPF_LDX 0x01 - BPF_ST 0x02 BPF_ST 0x02 - BPF_STX 0x03 BPF_STX 0x03 - BPF_ALU 0x04 BPF_ALU 0x04 - BPF_JMP 0x05 BPF_JMP 0x05 - BPF_RET 0x06 BPF_JMP32 0x06 - BPF_MISC 0x07 BPF_ALU64 0x07 - -When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ... - - BPF_K 0x00 - BPF_X 0x08 - - * in classic BPF, this means: - - BPF_SRC(code) == BPF_X - use register X as source operand - BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand - - * in eBPF, this means: - - BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand - BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand - -... and four MSB bits store operation code. - -If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of: - - BPF_ADD 0x00 - BPF_SUB 0x10 - BPF_MUL 0x20 - BPF_DIV 0x30 - BPF_OR 0x40 - BPF_AND 0x50 - BPF_LSH 0x60 - BPF_RSH 0x70 - BPF_NEG 0x80 - BPF_MOD 0x90 - BPF_XOR 0xa0 - BPF_MOV 0xb0 /* eBPF only: mov reg to reg */ - BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */ - BPF_END 0xd0 /* eBPF only: endianness conversion */ - -If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of: - - BPF_JA 0x00 /* BPF_JMP only */ - BPF_JEQ 0x10 - BPF_JGT 0x20 - BPF_JGE 0x30 - BPF_JSET 0x40 - BPF_JNE 0x50 /* eBPF only: jump != */ - BPF_JSGT 0x60 /* eBPF only: signed '>' */ - BPF_JSGE 0x70 /* eBPF only: signed '>=' */ - BPF_CALL 0x80 /* eBPF BPF_JMP only: function call */ - BPF_EXIT 0x90 /* eBPF BPF_JMP only: function return */ - BPF_JLT 0xa0 /* eBPF only: unsigned '<' */ - BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */ - BPF_JSLT 0xc0 /* eBPF only: signed '<' */ - BPF_JSLE 0xd0 /* eBPF only: signed '<=' */ - -So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF -and eBPF. There are only two registers in classic BPF, so it means A += X. -In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly, -BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous -src_reg = (u32) src_reg ^ (u32) imm32 in eBPF. - -Classic BPF is using BPF_MISC class to represent A = X and X = A moves. -eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no -BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean -exactly the same operations as BPF_ALU, but with 64-bit wide operands -instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.: -dst_reg = dst_reg + src_reg - -Classic BPF wastes the whole BPF_RET class to represent a single 'ret' -operation. Classic BPF_RET | BPF_K means copy imm32 into return register -and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT -in eBPF means function exit only. The eBPF program needs to store return -value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as -BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide -operands for the comparisons instead. - -For load and store instructions the 8-bit 'code' field is divided as: - - +--------+--------+-------------------+ - | 3 bits | 2 bits | 3 bits | - | mode | size | instruction class | - +--------+--------+-------------------+ - (MSB) (LSB) - -Size modifier is one of ... - - BPF_W 0x00 /* word */ - BPF_H 0x08 /* half word */ - BPF_B 0x10 /* byte */ - BPF_DW 0x18 /* eBPF only, double word */ - -... which encodes size of load/store operation: - - B - 1 byte - H - 2 byte - W - 4 byte - DW - 8 byte (eBPF only) - -Mode modifier is one of: - - BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */ - BPF_ABS 0x20 - BPF_IND 0x40 - BPF_MEM 0x60 - BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */ - BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */ - BPF_XADD 0xc0 /* eBPF only, exclusive add */ - -eBPF has two non-generic instructions: (BPF_ABS | | BPF_LD) and -(BPF_IND | | BPF_LD) which are used to access packet data. - -They had to be carried over from classic to have strong performance of -socket filters running in eBPF interpreter. These instructions can only -be used when interpreter context is a pointer to 'struct sk_buff' and -have seven implicit operands. Register R6 is an implicit input that must -contain pointer to sk_buff. Register R0 is an implicit output which contains -the data fetched from the packet. Registers R1-R5 are scratch registers -and must not be used to store the data across BPF_ABS | BPF_LD or -BPF_IND | BPF_LD instructions. - -These instructions have implicit program exit condition as well. When -eBPF program is trying to access the data beyond the packet boundary, -the interpreter will abort the execution of the program. JIT compilers -therefore must preserve this property. src_reg and imm32 fields are -explicit inputs to these instructions. - -For example: - - BPF_IND | BPF_W | BPF_LD means: - - R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32)) - and R1 - R5 were scratched. - -Unlike classic BPF instruction set, eBPF has generic load/store operations: - -BPF_MEM | | BPF_STX: *(size *) (dst_reg + off) = src_reg -BPF_MEM | | BPF_ST: *(size *) (dst_reg + off) = imm32 -BPF_MEM | | BPF_LDX: dst_reg = *(size *) (src_reg + off) -BPF_XADD | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg -BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg - -Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and -2 byte atomic increments are not supported. - -eBPF has one 16-byte instruction: BPF_LD | BPF_DW | BPF_IMM which consists -of two consecutive 'struct bpf_insn' 8-byte blocks and interpreted as single -instruction that loads 64-bit immediate value into a dst_reg. -Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads -32-bit immediate value into a register. - -eBPF verifier -------------- -The safety of the eBPF program is determined in two steps. - -First step does DAG check to disallow loops and other CFG validation. -In particular it will detect programs that have unreachable instructions. -(though classic BPF checker allows them) - -Second step starts from the first insn and descends all possible paths. -It simulates execution of every insn and observes the state change of -registers and stack. - -At the start of the program the register R1 contains a pointer to context -and has type PTR_TO_CTX. -If verifier sees an insn that does R2=R1, then R2 has now type -PTR_TO_CTX as well and can be used on the right hand side of expression. -If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=SCALAR_VALUE, -since addition of two valid pointers makes invalid pointer. -(In 'secure' mode verifier will reject any type of pointer arithmetic to make -sure that kernel addresses don't leak to unprivileged users) - -If register was never written to, it's not readable: - bpf_mov R0 = R2 - bpf_exit -will be rejected, since R2 is unreadable at the start of the program. - -After kernel function call, R1-R5 are reset to unreadable and -R0 has a return type of the function. - -Since R6-R9 are callee saved, their state is preserved across the call. - bpf_mov R6 = 1 - bpf_call foo - bpf_mov R0 = R6 - bpf_exit -is a correct program. If there was R1 instead of R6, it would have -been rejected. - -load/store instructions are allowed only with registers of valid types, which -are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked. -For example: - bpf_mov R1 = 1 - bpf_mov R2 = 2 - bpf_xadd *(u32 *)(R1 + 3) += R2 - bpf_exit -will be rejected, since R1 doesn't have a valid pointer type at the time of -execution of instruction bpf_xadd. - -At the start R1 type is PTR_TO_CTX (a pointer to generic 'struct bpf_context') -A callback is used to customize verifier to restrict eBPF program access to only -certain fields within ctx structure with specified size and alignment. - -For example, the following insn: - bpf_ld R0 = *(u32 *)(R6 + 8) -intends to load a word from address R6 + 8 and store it into R0 -If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know -that offset 8 of size 4 bytes can be accessed for reading, otherwise -the verifier will reject the program. -If R6=PTR_TO_STACK, then access should be aligned and be within -stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8, -so it will fail verification, since it's out of bounds. - -The verifier will allow eBPF program to read data from stack only after -it wrote into it. -Classic BPF verifier does similar check with M[0-15] memory slots. -For example: - bpf_ld R0 = *(u32 *)(R10 - 4) - bpf_exit -is invalid program. -Though R10 is correct read-only register and has type PTR_TO_STACK -and R10 - 4 is within stack bounds, there were no stores into that location. - -Pointer register spill/fill is tracked as well, since four (R6-R9) -callee saved registers may not be enough for some programs. - -Allowed function calls are customized with bpf_verifier_ops->get_func_proto() -The eBPF verifier will check that registers match argument constraints. -After the call register R0 will be set to return type of the function. - -Function calls is a main mechanism to extend functionality of eBPF programs. -Socket filters may let programs to call one set of functions, whereas tracing -filters may allow completely different set. - -If a function made accessible to eBPF program, it needs to be thought through -from safety point of view. The verifier will guarantee that the function is -called with valid arguments. - -seccomp vs socket filters have different security restrictions for classic BPF. -Seccomp solves this by two stage verifier: classic BPF verifier is followed -by seccomp verifier. In case of eBPF one configurable verifier is shared for -all use cases. - -See details of eBPF verifier in kernel/bpf/verifier.c - -Register value tracking ------------------------ -In order to determine the safety of an eBPF program, the verifier must track -the range of possible values in each register and also in each stack slot. -This is done with 'struct bpf_reg_state', defined in include/linux/ -bpf_verifier.h, which unifies tracking of scalar and pointer values. Each -register state has a type, which is either NOT_INIT (the register has not been -written to), SCALAR_VALUE (some value which is not usable as a pointer), or a -pointer type. The types of pointers describe their base, as follows: - PTR_TO_CTX Pointer to bpf_context. - CONST_PTR_TO_MAP Pointer to struct bpf_map. "Const" because arithmetic - on these pointers is forbidden. - PTR_TO_MAP_VALUE Pointer to the value stored in a map element. - PTR_TO_MAP_VALUE_OR_NULL - Either a pointer to a map value, or NULL; map accesses - (see section 'eBPF maps', below) return this type, - which becomes a PTR_TO_MAP_VALUE when checked != NULL. - Arithmetic on these pointers is forbidden. - PTR_TO_STACK Frame pointer. - PTR_TO_PACKET skb->data. - PTR_TO_PACKET_END skb->data + headlen; arithmetic forbidden. - PTR_TO_SOCKET Pointer to struct bpf_sock_ops, implicitly refcounted. - PTR_TO_SOCKET_OR_NULL - Either a pointer to a socket, or NULL; socket lookup - returns this type, which becomes a PTR_TO_SOCKET when - checked != NULL. PTR_TO_SOCKET is reference-counted, - so programs must release the reference through the - socket release function before the end of the program. - Arithmetic on these pointers is forbidden. -However, a pointer may be offset from this base (as a result of pointer -arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable -offset'. The former is used when an exactly-known value (e.g. an immediate -operand) is added to a pointer, while the latter is used for values which are -not exactly known. The variable offset is also used in SCALAR_VALUEs, to track -the range of possible values in the register. -The verifier's knowledge about the variable offset consists of: -* minimum and maximum values as unsigned -* minimum and maximum values as signed -* knowledge of the values of individual bits, in the form of a 'tnum': a u64 -'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown; -1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both -mask and value; no bit should ever be 1 in both. For example, if a byte is read -into a register from memory, the register's top 56 bits are known zero, while -the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we -then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0; -0x1ff), because of potential carries. - -Besides arithmetic, the register state can also be updated by conditional -branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch -it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false' -branch it will have a umax_value of 8. A signed compare (with BPF_JSGT or -BPF_JSGE) would instead update the signed minimum/maximum values. Information -from the signed and unsigned bounds can be combined; for instance if a value is -first tested < 8 and then tested s> 4, the verifier will conclude that the value -is also > 4 and s< 8, since the bounds prevent crossing the sign boundary. - -PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all -pointers sharing that same variable offset. This is important for packet range -checks: after adding a variable to a packet pointer register A, if you then copy -it to another register B and then add a constant 4 to A, both registers will -share the same 'id' but the A will have a fixed offset of +4. Then if A is -bounds-checked and found to be less than a PTR_TO_PACKET_END, the register B is -now known to have a safe range of at least 4 bytes. See 'Direct packet access', -below, for more on PTR_TO_PACKET ranges. - -The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of -the pointer returned from a map lookup. This means that when one copy is -checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs. -As well as range-checking, the tracked information is also used for enforcing -alignment of pointer accesses. For instance, on most systems the packet pointer -is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump -over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting -pointer will have a variable offset known to be 4n+2 for some n, so adding the 2 -bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through -that pointer are safe. -The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, common -to all copies of the pointer returned from a socket lookup. This has similar -behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but -it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly -represents a reference to the corresponding 'struct sock'. To ensure that the -reference is not leaked, it is imperative to NULL-check the reference and in -the non-NULL case, and pass the valid reference to the socket release function. - -Direct packet access --------------------- -In cls_bpf and act_bpf programs the verifier allows direct access to the packet -data via skb->data and skb->data_end pointers. -Ex: -1: r4 = *(u32 *)(r1 +80) /* load skb->data_end */ -2: r3 = *(u32 *)(r1 +76) /* load skb->data */ -3: r5 = r3 -4: r5 += 14 -5: if r5 > r4 goto pc+16 -R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp -6: r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */ - -this 2byte load from the packet is safe to do, since the program author -did check 'if (skb->data + 14 > skb->data_end) goto err' at insn #5 which -means that in the fall-through case the register R3 (which points to skb->data) -has at least 14 directly accessible bytes. The verifier marks it -as R3=pkt(id=0,off=0,r=14). -id=0 means that no additional variables were added to the register. -off=0 means that no additional constants were added. -r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok. -Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points -to the packet data, but constant 14 was added to the register, so -it now points to 'skb->data + 14' and accessible range is [R5, R5 + 14 - 14) -which is zero bytes. - -More complex packet access may look like: - R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp - 6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */ - 7: r4 = *(u8 *)(r3 +12) - 8: r4 *= 14 - 9: r3 = *(u32 *)(r1 +76) /* load skb->data */ -10: r3 += r4 -11: r2 = r1 -12: r2 <<= 48 -13: r2 >>= 48 -14: r3 += r2 -15: r2 = r3 -16: r2 += 8 -17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */ -18: if r2 > r1 goto pc+2 - R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp -19: r1 = *(u8 *)(r3 +4) -The state of the register R3 is R3=pkt(id=2,off=0,r=8) -id=2 means that two 'r3 += rX' instructions were seen, so r3 points to some -offset within a packet and since the program author did -'if (r3 + 8 > r1) goto err' at insn #18, the safe range is [R3, R3 + 8). -The verifier only allows 'add'/'sub' operations on packet registers. Any other -operation will set the register state to 'SCALAR_VALUE' and it won't be -available for direct packet access. -Operation 'r3 += rX' may overflow and become less than original skb->data, -therefore the verifier has to prevent that. So when it sees 'r3 += rX' -instruction and rX is more than 16-bit value, any subsequent bounds-check of r3 -against skb->data_end will not give us 'range' information, so attempts to read -through the pointer will give "invalid access to packet" error. -Ex. after insn 'r4 = *(u8 *)(r3 +12)' (insn #7 above) the state of r4 is -R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits -of the register are guaranteed to be zero, and nothing is known about the lower -8 bits. After insn 'r4 *= 14' the state becomes -R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit -value by constant 14 will keep upper 52 bits as zero, also the least significant -bit will be zero as 14 is even. Similarly 'r2 >>= 48' will make -R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign -extending. This logic is implemented in adjust_reg_min_max_vals() function, -which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice -versa) and adjust_scalar_min_max_vals() for operations on two scalars. - -The end result is that bpf program author can access packet directly -using normal C code as: - void *data = (void *)(long)skb->data; - void *data_end = (void *)(long)skb->data_end; - struct eth_hdr *eth = data; - struct iphdr *iph = data + sizeof(*eth); - struct udphdr *udp = data + sizeof(*eth) + sizeof(*iph); - - if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end) - return 0; - if (eth->h_proto != htons(ETH_P_IP)) - return 0; - if (iph->protocol != IPPROTO_UDP || iph->ihl != 5) - return 0; - if (udp->dest == 53 || udp->source == 9) - ...; -which makes such programs easier to write comparing to LD_ABS insn -and significantly faster. - -eBPF maps ---------- -'maps' is a generic storage of different types for sharing data between kernel -and userspace. - -The maps are accessed from user space via BPF syscall, which has commands: -- create a map with given type and attributes - map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size) - using attr->map_type, attr->key_size, attr->value_size, attr->max_entries - returns process-local file descriptor or negative error - -- lookup key in a given map - err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size) - using attr->map_fd, attr->key, attr->value - returns zero and stores found elem into value or negative error - -- create or update key/value pair in a given map - err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size) - using attr->map_fd, attr->key, attr->value - returns zero or negative error - -- find and delete element by key in a given map - err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size) - using attr->map_fd, attr->key - -- to delete map: close(fd) - Exiting process will delete maps automatically - -userspace programs use this syscall to create/access maps that eBPF programs -are concurrently updating. - -maps can have different types: hash, array, bloom filter, radix-tree, etc. - -The map is defined by: - . type - . max number of elements - . key size in bytes - . value size in bytes - -Pruning -------- -The verifier does not actually walk all possible paths through the program. For -each new branch to analyse, the verifier looks at all the states it's previously -been in when at this instruction. If any of them contain the current state as a -subset, the branch is 'pruned' - that is, the fact that the previous state was -accepted implies the current state would be as well. For instance, if in the -previous state, r1 held a packet-pointer, and in the current state, r1 holds a -packet-pointer with a range as long or longer and at least as strict an -alignment, then r1 is safe. Similarly, if r2 was NOT_INIT before then it can't -have been used by any path from that point, so any value in r2 (including -another NOT_INIT) is safe. The implementation is in the function regsafe(). -Pruning considers not only the registers but also the stack (and any spilled -registers it may hold). They must all be safe for the branch to be pruned. -This is implemented in states_equal(). - -Understanding eBPF verifier messages ------------------------------------- - -The following are few examples of invalid eBPF programs and verifier error -messages as seen in the log: - -Program with unreachable instructions: -static struct bpf_insn prog[] = { - BPF_EXIT_INSN(), - BPF_EXIT_INSN(), -}; -Error: - unreachable insn 1 - -Program that reads uninitialized register: - BPF_MOV64_REG(BPF_REG_0, BPF_REG_2), - BPF_EXIT_INSN(), -Error: - 0: (bf) r0 = r2 - R2 !read_ok - -Program that doesn't initialize R0 before exiting: - BPF_MOV64_REG(BPF_REG_2, BPF_REG_1), - BPF_EXIT_INSN(), -Error: - 0: (bf) r2 = r1 - 1: (95) exit - R0 !read_ok - -Program that accesses stack out of bounds: - BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0), - BPF_EXIT_INSN(), -Error: - 0: (7a) *(u64 *)(r10 +8) = 0 - invalid stack off=8 size=8 - -Program that doesn't initialize stack before passing its address into function: - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_LD_MAP_FD(BPF_REG_1, 0), - BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), - BPF_EXIT_INSN(), -Error: - 0: (bf) r2 = r10 - 1: (07) r2 += -8 - 2: (b7) r1 = 0x0 - 3: (85) call 1 - invalid indirect read from stack off -8+0 size 8 - -Program that uses invalid map_fd=0 while calling to map_lookup_elem() function: - BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_LD_MAP_FD(BPF_REG_1, 0), - BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), - BPF_EXIT_INSN(), -Error: - 0: (7a) *(u64 *)(r10 -8) = 0 - 1: (bf) r2 = r10 - 2: (07) r2 += -8 - 3: (b7) r1 = 0x0 - 4: (85) call 1 - fd 0 is not pointing to valid bpf_map - -Program that doesn't check return value of map_lookup_elem() before accessing -map element: - BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_LD_MAP_FD(BPF_REG_1, 0), - BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), - BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), - BPF_EXIT_INSN(), -Error: - 0: (7a) *(u64 *)(r10 -8) = 0 - 1: (bf) r2 = r10 - 2: (07) r2 += -8 - 3: (b7) r1 = 0x0 - 4: (85) call 1 - 5: (7a) *(u64 *)(r0 +0) = 0 - R0 invalid mem access 'map_value_or_null' - -Program that correctly checks map_lookup_elem() returned value for NULL, but -accesses the memory with incorrect alignment: - BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_LD_MAP_FD(BPF_REG_1, 0), - BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), - BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1), - BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0), - BPF_EXIT_INSN(), -Error: - 0: (7a) *(u64 *)(r10 -8) = 0 - 1: (bf) r2 = r10 - 2: (07) r2 += -8 - 3: (b7) r1 = 1 - 4: (85) call 1 - 5: (15) if r0 == 0x0 goto pc+1 - R0=map_ptr R10=fp - 6: (7a) *(u64 *)(r0 +4) = 0 - misaligned access off 4 size 8 - -Program that correctly checks map_lookup_elem() returned value for NULL and -accesses memory with correct alignment in one side of 'if' branch, but fails -to do so in the other side of 'if' branch: - BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_LD_MAP_FD(BPF_REG_1, 0), - BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), - BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), - BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), - BPF_EXIT_INSN(), - BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1), - BPF_EXIT_INSN(), -Error: - 0: (7a) *(u64 *)(r10 -8) = 0 - 1: (bf) r2 = r10 - 2: (07) r2 += -8 - 3: (b7) r1 = 1 - 4: (85) call 1 - 5: (15) if r0 == 0x0 goto pc+2 - R0=map_ptr R10=fp - 6: (7a) *(u64 *)(r0 +0) = 0 - 7: (95) exit - - from 5 to 8: R0=imm0 R10=fp - 8: (7a) *(u64 *)(r0 +0) = 1 - R0 invalid mem access 'imm' - -Program that performs a socket lookup then sets the pointer to NULL without -checking it: -value: - BPF_MOV64_IMM(BPF_REG_2, 0), - BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_MOV64_IMM(BPF_REG_3, 4), - BPF_MOV64_IMM(BPF_REG_4, 0), - BPF_MOV64_IMM(BPF_REG_5, 0), - BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), - BPF_MOV64_IMM(BPF_REG_0, 0), - BPF_EXIT_INSN(), -Error: - 0: (b7) r2 = 0 - 1: (63) *(u32 *)(r10 -8) = r2 - 2: (bf) r2 = r10 - 3: (07) r2 += -8 - 4: (b7) r3 = 4 - 5: (b7) r4 = 0 - 6: (b7) r5 = 0 - 7: (85) call bpf_sk_lookup_tcp#65 - 8: (b7) r0 = 0 - 9: (95) exit - Unreleased reference id=1, alloc_insn=7 - -Program that performs a socket lookup but does not NULL-check the returned -value: - BPF_MOV64_IMM(BPF_REG_2, 0), - BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_MOV64_IMM(BPF_REG_3, 4), - BPF_MOV64_IMM(BPF_REG_4, 0), - BPF_MOV64_IMM(BPF_REG_5, 0), - BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), - BPF_EXIT_INSN(), -Error: - 0: (b7) r2 = 0 - 1: (63) *(u32 *)(r10 -8) = r2 - 2: (bf) r2 = r10 - 3: (07) r2 += -8 - 4: (b7) r3 = 4 - 5: (b7) r4 = 0 - 6: (b7) r5 = 0 - 7: (85) call bpf_sk_lookup_tcp#65 - 8: (95) exit - Unreleased reference id=1, alloc_insn=7 - -Testing -------- - -Next to the BPF toolchain, the kernel also ships a test module that contains -various test cases for classic and internal BPF that can be executed against -the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and -enabled via Kconfig: - - CONFIG_TEST_BPF=m - -After the module has been built and installed, the test suite can be executed -via insmod or modprobe against 'test_bpf' module. Results of the test cases -including timings in nsec can be found in the kernel log (dmesg). - -Misc ----- - -Also trinity, the Linux syscall fuzzer, has built-in support for BPF and -SECCOMP-BPF kernel fuzzing. - -Written by ----------- - -The document was written in the hope that it is found useful and in order -to give potential BPF hackers or security auditors a better overview of -the underlying architecture. - -Jay Schulist -Daniel Borkmann -Alexei Starovoitov diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 807abe25ae4b..144ed838c1a9 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -56,6 +56,7 @@ Contents: driver eql fib_trie + filter .. only:: subproject and html diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt index 999eb41da81d..494614573c67 100644 --- a/Documentation/networking/packet_mmap.txt +++ b/Documentation/networking/packet_mmap.txt @@ -1051,7 +1051,7 @@ for more information on hardware timestamps. ------------------------------------------------------------------------------- - Packet sockets work well together with Linux socket filters, thus you also - might want to have a look at Documentation/networking/filter.txt + might want to have a look at Documentation/networking/filter.rst -------------------------------------------------------------------------------- + THANKS diff --git a/MAINTAINERS b/MAINTAINERS index 7323bfc1720f..4ec6d2741d36 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3192,7 +3192,7 @@ Q: https://patchwork.ozlabs.org/project/netdev/list/?delegate=77147 T: git git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git T: git git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git F: Documentation/bpf/ -F: Documentation/networking/filter.txt +F: Documentation/networking/filter.rst F: arch/*/net/* F: include/linux/bpf* F: include/linux/filter.h diff --git a/tools/bpf/bpf_asm.c b/tools/bpf/bpf_asm.c index e5f95e3eede3..0063c3c029e7 100644 --- a/tools/bpf/bpf_asm.c +++ b/tools/bpf/bpf_asm.c @@ -11,7 +11,7 @@ * * How to get into it: * - * 1) read Documentation/networking/filter.txt + * 1) read Documentation/networking/filter.rst * 2) Run `bpf_asm [-c] ` to translate into binary * blob that is loadable with xt_bpf, cls_bpf et al. Note: -c will * pretty print a C-like construct. diff --git a/tools/bpf/bpf_dbg.c b/tools/bpf/bpf_dbg.c index 9d3766e653a9..a0ebcdf59c31 100644 --- a/tools/bpf/bpf_dbg.c +++ b/tools/bpf/bpf_dbg.c @@ -13,7 +13,7 @@ * for making a verdict when multiple simple BPF programs are combined * into one in order to prevent parsing same headers multiple times. * - * More on how to debug BPF opcodes see Documentation/networking/filter.txt + * More on how to debug BPF opcodes see Documentation/networking/filter.rst * which is the main document on BPF. Mini howto for getting started: * * 1) `./bpf_dbg` to enter the shell (shell cmds denoted with '>'): -- cgit v1.2.3 From 62502dff2c5012c19727bd992b0101a816095f1e Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:37 +0200 Subject: docs: networking: convert fore200e.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/fore200e.rst | 66 +++++++++++++++++++++++++++++++++++ Documentation/networking/fore200e.txt | 64 --------------------------------- Documentation/networking/index.rst | 1 + drivers/atm/Kconfig | 2 +- 4 files changed, 68 insertions(+), 65 deletions(-) create mode 100644 Documentation/networking/fore200e.rst delete mode 100644 Documentation/networking/fore200e.txt (limited to 'Documentation') diff --git a/Documentation/networking/fore200e.rst b/Documentation/networking/fore200e.rst new file mode 100644 index 000000000000..55df9ec09ac8 --- /dev/null +++ b/Documentation/networking/fore200e.rst @@ -0,0 +1,66 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================= +FORE Systems PCA-200E/SBA-200E ATM NIC driver +============================================= + +This driver adds support for the FORE Systems 200E-series ATM adapters +to the Linux operating system. It is based on the earlier PCA-200E driver +written by Uwe Dannowski. + +The driver simultaneously supports PCA-200E and SBA-200E adapters on +i386, alpha (untested), powerpc, sparc and sparc64 archs. + +The intent is to enable the use of different models of FORE adapters at the +same time, by hosts that have several bus interfaces (such as PCI+SBUS, +or PCI+EISA). + +Only PCI and SBUS devices are currently supported by the driver, but support +for other bus interfaces such as EISA should not be too hard to add. + + +Firmware Copyright Notice +------------------------- + +Please read the fore200e_firmware_copyright file present +in the linux/drivers/atm directory for details and restrictions. + + +Firmware Updates +---------------- + +The FORE Systems 200E-series driver is shipped with firmware data being +uploaded to the ATM adapters at system boot time or at module loading time. +The supplied firmware images should work with all adapters. + +However, if you encounter problems (the firmware doesn't start or the driver +is unable to read the PROM data), you may consider trying another firmware +version. Alternative binary firmware images can be found somewhere on the +ForeThought CD-ROM supplied with your adapter by FORE Systems. + +You can also get the latest firmware images from FORE Systems at +https://en.wikipedia.org/wiki/FORE_Systems. Register TACTics Online and go to +the 'software updates' pages. The firmware binaries are part of +the various ForeThought software distributions. + +Notice that different versions of the PCA-200E firmware exist, depending +on the endianness of the host architecture. The driver is shipped with +both little and big endian PCA firmware images. + +Name and location of the new firmware images can be set at kernel +configuration time: + +1. Copy the new firmware binary files (with .bin, .bin1 or .bin2 suffix) + to some directory, such as linux/drivers/atm. + +2. Reconfigure your kernel to set the new firmware name and location. + Expected pathnames are absolute or relative to the drivers/atm directory. + +3. Rebuild and re-install your kernel or your module. + + +Feedback +-------- + +Feedback is welcome. Please send success stories/bug reports/ +patches/improvement/comments/flames to . diff --git a/Documentation/networking/fore200e.txt b/Documentation/networking/fore200e.txt deleted file mode 100644 index 1f98f62b4370..000000000000 --- a/Documentation/networking/fore200e.txt +++ /dev/null @@ -1,64 +0,0 @@ - -FORE Systems PCA-200E/SBA-200E ATM NIC driver ---------------------------------------------- - -This driver adds support for the FORE Systems 200E-series ATM adapters -to the Linux operating system. It is based on the earlier PCA-200E driver -written by Uwe Dannowski. - -The driver simultaneously supports PCA-200E and SBA-200E adapters on -i386, alpha (untested), powerpc, sparc and sparc64 archs. - -The intent is to enable the use of different models of FORE adapters at the -same time, by hosts that have several bus interfaces (such as PCI+SBUS, -or PCI+EISA). - -Only PCI and SBUS devices are currently supported by the driver, but support -for other bus interfaces such as EISA should not be too hard to add. - - -Firmware Copyright Notice -------------------------- - -Please read the fore200e_firmware_copyright file present -in the linux/drivers/atm directory for details and restrictions. - - -Firmware Updates ----------------- - -The FORE Systems 200E-series driver is shipped with firmware data being -uploaded to the ATM adapters at system boot time or at module loading time. -The supplied firmware images should work with all adapters. - -However, if you encounter problems (the firmware doesn't start or the driver -is unable to read the PROM data), you may consider trying another firmware -version. Alternative binary firmware images can be found somewhere on the -ForeThought CD-ROM supplied with your adapter by FORE Systems. - -You can also get the latest firmware images from FORE Systems at -https://en.wikipedia.org/wiki/FORE_Systems. Register TACTics Online and go to -the 'software updates' pages. The firmware binaries are part of -the various ForeThought software distributions. - -Notice that different versions of the PCA-200E firmware exist, depending -on the endianness of the host architecture. The driver is shipped with -both little and big endian PCA firmware images. - -Name and location of the new firmware images can be set at kernel -configuration time: - -1. Copy the new firmware binary files (with .bin, .bin1 or .bin2 suffix) - to some directory, such as linux/drivers/atm. - -2. Reconfigure your kernel to set the new firmware name and location. - Expected pathnames are absolute or relative to the drivers/atm directory. - -3. Rebuild and re-install your kernel or your module. - - -Feedback --------- - -Feedback is welcome. Please send success stories/bug reports/ -patches/improvement/comments/flames to . diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 144ed838c1a9..b2fb8b907d68 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -57,6 +57,7 @@ Contents: eql fib_trie filter + fore200e .. only:: subproject and html diff --git a/drivers/atm/Kconfig b/drivers/atm/Kconfig index 8c37294f1d1e..4af7cbdcc349 100644 --- a/drivers/atm/Kconfig +++ b/drivers/atm/Kconfig @@ -336,7 +336,7 @@ config ATM_FORE200E on PCI and SBUS hosts. Say Y (or M to compile as a module named fore_200e) here if you have one of these ATM adapters. - See the file for + See the file for further details. config ATM_FORE200E_USE_TASKLET -- cgit v1.2.3 From 5b0d74b54c7f1cb9c65955df78dffe112e1959c1 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:38 +0200 Subject: docs: networking: convert framerelay.txt to ReST - add SPDX header; - add a document title; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/framerelay.rst | 44 +++++++++++++++++++++++++++++++++ Documentation/networking/framerelay.txt | 39 ----------------------------- Documentation/networking/index.rst | 1 + drivers/net/wan/Kconfig | 4 +-- 4 files changed, 47 insertions(+), 41 deletions(-) create mode 100644 Documentation/networking/framerelay.rst delete mode 100644 Documentation/networking/framerelay.txt (limited to 'Documentation') diff --git a/Documentation/networking/framerelay.rst b/Documentation/networking/framerelay.rst new file mode 100644 index 000000000000..6d904399ec6d --- /dev/null +++ b/Documentation/networking/framerelay.rst @@ -0,0 +1,44 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================ +Frame Relay (FR) +================ + +Frame Relay (FR) support for linux is built into a two tiered system of device +drivers. The upper layer implements RFC1490 FR specification, and uses the +Data Link Connection Identifier (DLCI) as its hardware address. Usually these +are assigned by your network supplier, they give you the number/numbers of +the Virtual Connections (VC) assigned to you. + +Each DLCI is a point-to-point link between your machine and a remote one. +As such, a separate device is needed to accommodate the routing. Within the +net-tools archives is 'dlcicfg'. This program will communicate with the +base "DLCI" device, and create new net devices named 'dlci00', 'dlci01'... +The configuration script will ask you how many DLCIs you need, as well as +how many DLCIs you want to assign to each Frame Relay Access Device (FRAD). + +The DLCI uses a number of function calls to communicate with the FRAD, all +of which are stored in the FRAD's private data area. assoc/deassoc, +activate/deactivate and dlci_config. The DLCI supplies a receive function +to the FRAD to accept incoming packets. + +With this initial offering, only 1 FRAD driver is available. With many thanks +to Sangoma Technologies, David Mandelstam & Gene Kozin, the S502A, S502E & +S508 are supported. This driver is currently set up for only FR, but as +Sangoma makes more firmware modules available, it can be updated to provide +them as well. + +Configuration of the FRAD makes use of another net-tools program, 'fradcfg'. +This program makes use of a configuration file (which dlcicfg can also read) +to specify the types of boards to be configured as FRADs, as well as perform +any board specific configuration. The Sangoma module of fradcfg loads the +FR firmware into the card, sets the irq/port/memory information, and provides +an initial configuration. + +Additional FRAD device drivers can be added as hardware is available. + +At this time, the dlcicfg and fradcfg programs have not been incorporated into +the net-tools distribution. They can be found at ftp.invlogic.com, in +/pub/linux. Note that with OS/2 FTPD, you end up in /pub by default, so just +use 'cd linux'. v0.10 is for use on pre-2.0.3 and earlier, v0.15 is for +pre-2.0.4 and later. diff --git a/Documentation/networking/framerelay.txt b/Documentation/networking/framerelay.txt deleted file mode 100644 index 1a0b720440dd..000000000000 --- a/Documentation/networking/framerelay.txt +++ /dev/null @@ -1,39 +0,0 @@ -Frame Relay (FR) support for linux is built into a two tiered system of device -drivers. The upper layer implements RFC1490 FR specification, and uses the -Data Link Connection Identifier (DLCI) as its hardware address. Usually these -are assigned by your network supplier, they give you the number/numbers of -the Virtual Connections (VC) assigned to you. - -Each DLCI is a point-to-point link between your machine and a remote one. -As such, a separate device is needed to accommodate the routing. Within the -net-tools archives is 'dlcicfg'. This program will communicate with the -base "DLCI" device, and create new net devices named 'dlci00', 'dlci01'... -The configuration script will ask you how many DLCIs you need, as well as -how many DLCIs you want to assign to each Frame Relay Access Device (FRAD). - -The DLCI uses a number of function calls to communicate with the FRAD, all -of which are stored in the FRAD's private data area. assoc/deassoc, -activate/deactivate and dlci_config. The DLCI supplies a receive function -to the FRAD to accept incoming packets. - -With this initial offering, only 1 FRAD driver is available. With many thanks -to Sangoma Technologies, David Mandelstam & Gene Kozin, the S502A, S502E & -S508 are supported. This driver is currently set up for only FR, but as -Sangoma makes more firmware modules available, it can be updated to provide -them as well. - -Configuration of the FRAD makes use of another net-tools program, 'fradcfg'. -This program makes use of a configuration file (which dlcicfg can also read) -to specify the types of boards to be configured as FRADs, as well as perform -any board specific configuration. The Sangoma module of fradcfg loads the -FR firmware into the card, sets the irq/port/memory information, and provides -an initial configuration. - -Additional FRAD device drivers can be added as hardware is available. - -At this time, the dlcicfg and fradcfg programs have not been incorporated into -the net-tools distribution. They can be found at ftp.invlogic.com, in -/pub/linux. Note that with OS/2 FTPD, you end up in /pub by default, so just -use 'cd linux'. v0.10 is for use on pre-2.0.3 and earlier, v0.15 is for -pre-2.0.4 and later. - diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index b2fb8b907d68..4e225f1f7039 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -58,6 +58,7 @@ Contents: fib_trie filter fore200e + framerelay .. only:: subproject and html diff --git a/drivers/net/wan/Kconfig b/drivers/net/wan/Kconfig index dbc0e3f7a3e2..3e21726c36e8 100644 --- a/drivers/net/wan/Kconfig +++ b/drivers/net/wan/Kconfig @@ -336,7 +336,7 @@ config DLCI To use frame relay, you need supporting hardware (called FRAD) and certain programs from the net-tools package as explained in - . + . To compile this driver as a module, choose M here: the module will be called dlci. @@ -361,7 +361,7 @@ config SDLA These are multi-protocol cards, but only Frame Relay is supported by the driver at this time. Please read - . + . To compile this driver as a module, choose M here: the module will be called sdla. -- cgit v1.2.3 From 16128ad8f927850a1121b7645c6381341d9c0b63 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:39 +0200 Subject: docs: networking: convert generic-hdlc.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/generic-hdlc.rst | 170 ++++++++++++++++++++++++++++++ Documentation/networking/generic-hdlc.txt | 132 ----------------------- Documentation/networking/index.rst | 1 + 3 files changed, 171 insertions(+), 132 deletions(-) create mode 100644 Documentation/networking/generic-hdlc.rst delete mode 100644 Documentation/networking/generic-hdlc.txt (limited to 'Documentation') diff --git a/Documentation/networking/generic-hdlc.rst b/Documentation/networking/generic-hdlc.rst new file mode 100644 index 000000000000..1c3bb5cb98d4 --- /dev/null +++ b/Documentation/networking/generic-hdlc.rst @@ -0,0 +1,170 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================== +Generic HDLC layer +================== + +Krzysztof Halasa + + +Generic HDLC layer currently supports: + +1. Frame Relay (ANSI, CCITT, Cisco and no LMI) + + - Normal (routed) and Ethernet-bridged (Ethernet device emulation) + interfaces can share a single PVC. + - ARP support (no InARP support in the kernel - there is an + experimental InARP user-space daemon available on: + http://www.kernel.org/pub/linux/utils/net/hdlc/). + +2. raw HDLC - either IP (IPv4) interface or Ethernet device emulation +3. Cisco HDLC +4. PPP +5. X.25 (uses X.25 routines). + +Generic HDLC is a protocol driver only - it needs a low-level driver +for your particular hardware. + +Ethernet device emulation (using HDLC or Frame-Relay PVC) is compatible +with IEEE 802.1Q (VLANs) and 802.1D (Ethernet bridging). + + +Make sure the hdlc.o and the hardware driver are loaded. It should +create a number of "hdlc" (hdlc0 etc) network devices, one for each +WAN port. You'll need the "sethdlc" utility, get it from: + + http://www.kernel.org/pub/linux/utils/net/hdlc/ + +Compile sethdlc.c utility:: + + gcc -O2 -Wall -o sethdlc sethdlc.c + +Make sure you're using a correct version of sethdlc for your kernel. + +Use sethdlc to set physical interface, clock rate, HDLC mode used, +and add any required PVCs if using Frame Relay. +Usually you want something like:: + + sethdlc hdlc0 clock int rate 128000 + sethdlc hdlc0 cisco interval 10 timeout 25 + +or:: + + sethdlc hdlc0 rs232 clock ext + sethdlc hdlc0 fr lmi ansi + sethdlc hdlc0 create 99 + ifconfig hdlc0 up + ifconfig pvc0 localIP pointopoint remoteIP + +In Frame Relay mode, ifconfig master hdlc device up (without assigning +any IP address to it) before using pvc devices. + + +Setting interface: + +* v35 | rs232 | x21 | t1 | e1 + - sets physical interface for a given port + if the card has software-selectable interfaces + loopback + - activate hardware loopback (for testing only) +* clock ext + - both RX clock and TX clock external +* clock int + - both RX clock and TX clock internal +* clock txint + - RX clock external, TX clock internal +* clock txfromrx + - RX clock external, TX clock derived from RX clock +* rate + - sets clock rate in bps (for "int" or "txint" clock only) + + +Setting protocol: + +* hdlc - sets raw HDLC (IP-only) mode + + nrz / nrzi / fm-mark / fm-space / manchester - sets transmission code + + no-parity / crc16 / crc16-pr0 (CRC16 with preset zeros) / crc32-itu + + crc16-itu (CRC16 with ITU-T polynomial) / crc16-itu-pr0 - sets parity + +* hdlc-eth - Ethernet device emulation using HDLC. Parity and encoding + as above. + +* cisco - sets Cisco HDLC mode (IP, IPv6 and IPX supported) + + interval - time in seconds between keepalive packets + + timeout - time in seconds after last received keepalive packet before + we assume the link is down + +* ppp - sets synchronous PPP mode + +* x25 - sets X.25 mode + +* fr - Frame Relay mode + + lmi ansi / ccitt / cisco / none - LMI (link management) type + + dce - Frame Relay DCE (network) side LMI instead of default DTE (user). + + It has nothing to do with clocks! + + - t391 - link integrity verification polling timer (in seconds) - user + - t392 - polling verification timer (in seconds) - network + - n391 - full status polling counter - user + - n392 - error threshold - both user and network + - n393 - monitored events count - both user and network + +Frame-Relay only: + +* create n | delete n - adds / deletes PVC interface with DLCI #n. + Newly created interface will be named pvc0, pvc1 etc. + +* create ether n | delete ether n - adds a device for Ethernet-bridged + frames. The device will be named pvceth0, pvceth1 etc. + + + + +Board-specific issues +--------------------- + +n2.o and c101.o need parameters to work:: + + insmod n2 hw=io,irq,ram,ports[:io,irq,...] + +example:: + + insmod n2 hw=0x300,10,0xD0000,01 + +or:: + + insmod c101 hw=irq,ram[:irq,...] + +example:: + + insmod c101 hw=9,0xdc000 + +If built into the kernel, these drivers need kernel (command line) parameters:: + + n2.hw=io,irq,ram,ports:... + +or:: + + c101.hw=irq,ram:... + + + +If you have a problem with N2, C101 or PLX200SYN card, you can issue the +"private" command to see port's packet descriptor rings (in kernel logs):: + + sethdlc hdlc0 private + +The hardware driver has to be build with #define DEBUG_RINGS. +Attaching this info to bug reports would be helpful. Anyway, let me know +if you have problems using this. + +For patches and other info look at: +. diff --git a/Documentation/networking/generic-hdlc.txt b/Documentation/networking/generic-hdlc.txt deleted file mode 100644 index 4eb3cc40b702..000000000000 --- a/Documentation/networking/generic-hdlc.txt +++ /dev/null @@ -1,132 +0,0 @@ -Generic HDLC layer -Krzysztof Halasa - - -Generic HDLC layer currently supports: -1. Frame Relay (ANSI, CCITT, Cisco and no LMI) - - Normal (routed) and Ethernet-bridged (Ethernet device emulation) - interfaces can share a single PVC. - - ARP support (no InARP support in the kernel - there is an - experimental InARP user-space daemon available on: - http://www.kernel.org/pub/linux/utils/net/hdlc/). -2. raw HDLC - either IP (IPv4) interface or Ethernet device emulation -3. Cisco HDLC -4. PPP -5. X.25 (uses X.25 routines). - -Generic HDLC is a protocol driver only - it needs a low-level driver -for your particular hardware. - -Ethernet device emulation (using HDLC or Frame-Relay PVC) is compatible -with IEEE 802.1Q (VLANs) and 802.1D (Ethernet bridging). - - -Make sure the hdlc.o and the hardware driver are loaded. It should -create a number of "hdlc" (hdlc0 etc) network devices, one for each -WAN port. You'll need the "sethdlc" utility, get it from: - http://www.kernel.org/pub/linux/utils/net/hdlc/ - -Compile sethdlc.c utility: - gcc -O2 -Wall -o sethdlc sethdlc.c -Make sure you're using a correct version of sethdlc for your kernel. - -Use sethdlc to set physical interface, clock rate, HDLC mode used, -and add any required PVCs if using Frame Relay. -Usually you want something like: - - sethdlc hdlc0 clock int rate 128000 - sethdlc hdlc0 cisco interval 10 timeout 25 -or - sethdlc hdlc0 rs232 clock ext - sethdlc hdlc0 fr lmi ansi - sethdlc hdlc0 create 99 - ifconfig hdlc0 up - ifconfig pvc0 localIP pointopoint remoteIP - -In Frame Relay mode, ifconfig master hdlc device up (without assigning -any IP address to it) before using pvc devices. - - -Setting interface: - -* v35 | rs232 | x21 | t1 | e1 - sets physical interface for a given port - if the card has software-selectable interfaces - loopback - activate hardware loopback (for testing only) -* clock ext - both RX clock and TX clock external -* clock int - both RX clock and TX clock internal -* clock txint - RX clock external, TX clock internal -* clock txfromrx - RX clock external, TX clock derived from RX clock -* rate - sets clock rate in bps (for "int" or "txint" clock only) - - -Setting protocol: - -* hdlc - sets raw HDLC (IP-only) mode - nrz / nrzi / fm-mark / fm-space / manchester - sets transmission code - no-parity / crc16 / crc16-pr0 (CRC16 with preset zeros) / crc32-itu - crc16-itu (CRC16 with ITU-T polynomial) / crc16-itu-pr0 - sets parity - -* hdlc-eth - Ethernet device emulation using HDLC. Parity and encoding - as above. - -* cisco - sets Cisco HDLC mode (IP, IPv6 and IPX supported) - interval - time in seconds between keepalive packets - timeout - time in seconds after last received keepalive packet before - we assume the link is down - -* ppp - sets synchronous PPP mode - -* x25 - sets X.25 mode - -* fr - Frame Relay mode - lmi ansi / ccitt / cisco / none - LMI (link management) type - dce - Frame Relay DCE (network) side LMI instead of default DTE (user). - It has nothing to do with clocks! - t391 - link integrity verification polling timer (in seconds) - user - t392 - polling verification timer (in seconds) - network - n391 - full status polling counter - user - n392 - error threshold - both user and network - n393 - monitored events count - both user and network - -Frame-Relay only: -* create n | delete n - adds / deletes PVC interface with DLCI #n. - Newly created interface will be named pvc0, pvc1 etc. - -* create ether n | delete ether n - adds a device for Ethernet-bridged - frames. The device will be named pvceth0, pvceth1 etc. - - - - -Board-specific issues ---------------------- - -n2.o and c101.o need parameters to work: - - insmod n2 hw=io,irq,ram,ports[:io,irq,...] -example: - insmod n2 hw=0x300,10,0xD0000,01 - -or - insmod c101 hw=irq,ram[:irq,...] -example: - insmod c101 hw=9,0xdc000 - -If built into the kernel, these drivers need kernel (command line) parameters: - n2.hw=io,irq,ram,ports:... -or - c101.hw=irq,ram:... - - - -If you have a problem with N2, C101 or PLX200SYN card, you can issue the -"private" command to see port's packet descriptor rings (in kernel logs): - - sethdlc hdlc0 private - -The hardware driver has to be build with #define DEBUG_RINGS. -Attaching this info to bug reports would be helpful. Anyway, let me know -if you have problems using this. - -For patches and other info look at: -. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 4e225f1f7039..d34824b27264 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -59,6 +59,7 @@ Contents: filter fore200e framerelay + generic-hdlc .. only:: subproject and html -- cgit v1.2.3 From 110662503de20f21ab22cf409753124d0977a339 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:40 +0200 Subject: docs: networking: convert generic_netlink.txt to ReST Not much to be done here: - add SPDX header; - add a document title; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/generic_netlink.rst | 9 +++++++++ Documentation/networking/generic_netlink.txt | 3 --- Documentation/networking/index.rst | 1 + 3 files changed, 10 insertions(+), 3 deletions(-) create mode 100644 Documentation/networking/generic_netlink.rst delete mode 100644 Documentation/networking/generic_netlink.txt (limited to 'Documentation') diff --git a/Documentation/networking/generic_netlink.rst b/Documentation/networking/generic_netlink.rst new file mode 100644 index 000000000000..59e04ccf80c1 --- /dev/null +++ b/Documentation/networking/generic_netlink.rst @@ -0,0 +1,9 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +Generic Netlink +=============== + +A wiki document on how to use Generic Netlink can be found here: + + * http://www.linuxfoundation.org/collaborate/workgroups/networking/generic_netlink_howto diff --git a/Documentation/networking/generic_netlink.txt b/Documentation/networking/generic_netlink.txt deleted file mode 100644 index 3e071115ca90..000000000000 --- a/Documentation/networking/generic_netlink.txt +++ /dev/null @@ -1,3 +0,0 @@ -A wiki document on how to use Generic Netlink can be found here: - - * http://www.linuxfoundation.org/collaborate/workgroups/networking/generic_netlink_howto diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index d34824b27264..42e556509e22 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -60,6 +60,7 @@ Contents: fore200e framerelay generic-hdlc + generic_netlink .. only:: subproject and html -- cgit v1.2.3 From 8c498935585680284e5f3e5294d3c901b7c89d57 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:41 +0200 Subject: docs: networking: convert gen_stats.txt to ReST - add SPDX header; - mark code blocks and literals as such; - mark tables as such; - mark lists as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/gen_stats.rst | 129 +++++++++++++++++++++++++++++++++ Documentation/networking/gen_stats.txt | 119 ------------------------------ Documentation/networking/index.rst | 1 + net/core/gen_stats.c | 2 +- 4 files changed, 131 insertions(+), 120 deletions(-) create mode 100644 Documentation/networking/gen_stats.rst delete mode 100644 Documentation/networking/gen_stats.txt (limited to 'Documentation') diff --git a/Documentation/networking/gen_stats.rst b/Documentation/networking/gen_stats.rst new file mode 100644 index 000000000000..595a83b9a61b --- /dev/null +++ b/Documentation/networking/gen_stats.rst @@ -0,0 +1,129 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============================================== +Generic networking statistics for netlink users +=============================================== + +Statistic counters are grouped into structs: + +==================== ===================== ===================== +Struct TLV type Description +==================== ===================== ===================== +gnet_stats_basic TCA_STATS_BASIC Basic statistics +gnet_stats_rate_est TCA_STATS_RATE_EST Rate estimator +gnet_stats_queue TCA_STATS_QUEUE Queue statistics +none TCA_STATS_APP Application specific +==================== ===================== ===================== + + +Collecting: +----------- + +Declare the statistic structs you need:: + + struct mystruct { + struct gnet_stats_basic bstats; + struct gnet_stats_queue qstats; + ... + }; + +Update statistics, in dequeue() methods only, (while owning qdisc->running):: + + mystruct->tstats.packet++; + mystruct->qstats.backlog += skb->pkt_len; + + +Export to userspace (Dump): +--------------------------- + +:: + + my_dumping_routine(struct sk_buff *skb, ...) + { + struct gnet_dump dump; + + if (gnet_stats_start_copy(skb, TCA_STATS2, &mystruct->lock, &dump, + TCA_PAD) < 0) + goto rtattr_failure; + + if (gnet_stats_copy_basic(&dump, &mystruct->bstats) < 0 || + gnet_stats_copy_queue(&dump, &mystruct->qstats) < 0 || + gnet_stats_copy_app(&dump, &xstats, sizeof(xstats)) < 0) + goto rtattr_failure; + + if (gnet_stats_finish_copy(&dump) < 0) + goto rtattr_failure; + ... + } + +TCA_STATS/TCA_XSTATS backward compatibility: +-------------------------------------------- + +Prior users of struct tc_stats and xstats can maintain backward +compatibility by calling the compat wrappers to keep providing the +existing TLV types:: + + my_dumping_routine(struct sk_buff *skb, ...) + { + if (gnet_stats_start_copy_compat(skb, TCA_STATS2, TCA_STATS, + TCA_XSTATS, &mystruct->lock, &dump, + TCA_PAD) < 0) + goto rtattr_failure; + ... + } + +A struct tc_stats will be filled out during gnet_stats_copy_* calls +and appended to the skb. TCA_XSTATS is provided if gnet_stats_copy_app +was called. + + +Locking: +-------- + +Locks are taken before writing and released once all statistics have +been written. Locks are always released in case of an error. You +are responsible for making sure that the lock is initialized. + + +Rate Estimator: +--------------- + +0) Prepare an estimator attribute. Most likely this would be in user + space. The value of this TLV should contain a tc_estimator structure. + As usual, such a TLV needs to be 32 bit aligned and therefore the + length needs to be appropriately set, etc. The estimator interval + and ewma log need to be converted to the appropriate values. + tc_estimator.c::tc_setup_estimator() is advisable to be used as the + conversion routine. It does a few clever things. It takes a time + interval in microsecs, a time constant also in microsecs and a struct + tc_estimator to be populated. The returned tc_estimator can be + transported to the kernel. Transfer such a structure in a TLV of type + TCA_RATE to your code in the kernel. + +In the kernel when setting up: + +1) make sure you have basic stats and rate stats setup first. +2) make sure you have initialized stats lock that is used to setup such + stats. +3) Now initialize a new estimator:: + + int ret = gen_new_estimator(my_basicstats,my_rate_est_stats, + mystats_lock, attr_with_tcestimator_struct); + + if ret == 0 + success + else + failed + +From now on, every time you dump my_rate_est_stats it will contain +up-to-date info. + +Once you are done, call gen_kill_estimator(my_basicstats, +my_rate_est_stats) Make sure that my_basicstats and my_rate_est_stats +are still valid (i.e still exist) at the time of making this call. + + +Authors: +-------- +- Thomas Graf +- Jamal Hadi Salim diff --git a/Documentation/networking/gen_stats.txt b/Documentation/networking/gen_stats.txt deleted file mode 100644 index 179b18ce45ff..000000000000 --- a/Documentation/networking/gen_stats.txt +++ /dev/null @@ -1,119 +0,0 @@ -Generic networking statistics for netlink users -====================================================================== - -Statistic counters are grouped into structs: - -Struct TLV type Description ----------------------------------------------------------------------- -gnet_stats_basic TCA_STATS_BASIC Basic statistics -gnet_stats_rate_est TCA_STATS_RATE_EST Rate estimator -gnet_stats_queue TCA_STATS_QUEUE Queue statistics -none TCA_STATS_APP Application specific - - -Collecting: ------------ - -Declare the statistic structs you need: -struct mystruct { - struct gnet_stats_basic bstats; - struct gnet_stats_queue qstats; - ... -}; - -Update statistics, in dequeue() methods only, (while owning qdisc->running) -mystruct->tstats.packet++; -mystruct->qstats.backlog += skb->pkt_len; - - -Export to userspace (Dump): ---------------------------- - -my_dumping_routine(struct sk_buff *skb, ...) -{ - struct gnet_dump dump; - - if (gnet_stats_start_copy(skb, TCA_STATS2, &mystruct->lock, &dump, - TCA_PAD) < 0) - goto rtattr_failure; - - if (gnet_stats_copy_basic(&dump, &mystruct->bstats) < 0 || - gnet_stats_copy_queue(&dump, &mystruct->qstats) < 0 || - gnet_stats_copy_app(&dump, &xstats, sizeof(xstats)) < 0) - goto rtattr_failure; - - if (gnet_stats_finish_copy(&dump) < 0) - goto rtattr_failure; - ... -} - -TCA_STATS/TCA_XSTATS backward compatibility: --------------------------------------------- - -Prior users of struct tc_stats and xstats can maintain backward -compatibility by calling the compat wrappers to keep providing the -existing TLV types. - -my_dumping_routine(struct sk_buff *skb, ...) -{ - if (gnet_stats_start_copy_compat(skb, TCA_STATS2, TCA_STATS, - TCA_XSTATS, &mystruct->lock, &dump, - TCA_PAD) < 0) - goto rtattr_failure; - ... -} - -A struct tc_stats will be filled out during gnet_stats_copy_* calls -and appended to the skb. TCA_XSTATS is provided if gnet_stats_copy_app -was called. - - -Locking: --------- - -Locks are taken before writing and released once all statistics have -been written. Locks are always released in case of an error. You -are responsible for making sure that the lock is initialized. - - -Rate Estimator: --------------- - -0) Prepare an estimator attribute. Most likely this would be in user - space. The value of this TLV should contain a tc_estimator structure. - As usual, such a TLV needs to be 32 bit aligned and therefore the - length needs to be appropriately set, etc. The estimator interval - and ewma log need to be converted to the appropriate values. - tc_estimator.c::tc_setup_estimator() is advisable to be used as the - conversion routine. It does a few clever things. It takes a time - interval in microsecs, a time constant also in microsecs and a struct - tc_estimator to be populated. The returned tc_estimator can be - transported to the kernel. Transfer such a structure in a TLV of type - TCA_RATE to your code in the kernel. - -In the kernel when setting up: -1) make sure you have basic stats and rate stats setup first. -2) make sure you have initialized stats lock that is used to setup such - stats. -3) Now initialize a new estimator: - - int ret = gen_new_estimator(my_basicstats,my_rate_est_stats, - mystats_lock, attr_with_tcestimator_struct); - - if ret == 0 - success - else - failed - -From now on, every time you dump my_rate_est_stats it will contain -up-to-date info. - -Once you are done, call gen_kill_estimator(my_basicstats, -my_rate_est_stats) Make sure that my_basicstats and my_rate_est_stats -are still valid (i.e still exist) at the time of making this call. - - -Authors: --------- -Thomas Graf -Jamal Hadi Salim diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 42e556509e22..33afbb67f3fa 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -61,6 +61,7 @@ Contents: framerelay generic-hdlc generic_netlink + gen_stats .. only:: subproject and html diff --git a/net/core/gen_stats.c b/net/core/gen_stats.c index 1d653fbfcf52..e491b083b348 100644 --- a/net/core/gen_stats.c +++ b/net/core/gen_stats.c @@ -6,7 +6,7 @@ * Jamal Hadi Salim * Alexey Kuznetsov, * - * See Documentation/networking/gen_stats.txt + * See Documentation/networking/gen_stats.rst */ #include -- cgit v1.2.3 From 81baecb6f6dc507f1b565e711b5193cdbb3fa939 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:42 +0200 Subject: docs: networking: convert gtp.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - add notes markups; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/gtp.rst | 251 +++++++++++++++++++++++++++++++++++++ Documentation/networking/gtp.txt | 230 --------------------------------- Documentation/networking/index.rst | 1 + 3 files changed, 252 insertions(+), 230 deletions(-) create mode 100644 Documentation/networking/gtp.rst delete mode 100644 Documentation/networking/gtp.txt (limited to 'Documentation') diff --git a/Documentation/networking/gtp.rst b/Documentation/networking/gtp.rst new file mode 100644 index 000000000000..1563fb94b289 --- /dev/null +++ b/Documentation/networking/gtp.rst @@ -0,0 +1,251 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================================== +The Linux kernel GTP tunneling module +===================================== + +Documentation by + Harald Welte and + Andreas Schultz + +In 'drivers/net/gtp.c' you are finding a kernel-level implementation +of a GTP tunnel endpoint. + +What is GTP +=========== + +GTP is the Generic Tunnel Protocol, which is a 3GPP protocol used for +tunneling User-IP payload between a mobile station (phone, modem) +and the interconnection between an external packet data network (such +as the internet). + +So when you start a 'data connection' from your mobile phone, the +phone will use the control plane to signal for the establishment of +such a tunnel between that external data network and the phone. The +tunnel endpoints thus reside on the phone and in the gateway. All +intermediate nodes just transport the encapsulated packet. + +The phone itself does not implement GTP but uses some other +technology-dependent protocol stack for transmitting the user IP +payload, such as LLC/SNDCP/RLC/MAC. + +At some network element inside the cellular operator infrastructure +(SGSN in case of GPRS/EGPRS or classic UMTS, hNodeB in case of a 3G +femtocell, eNodeB in case of 4G/LTE), the cellular protocol stacking +is translated into GTP *without breaking the end-to-end tunnel*. So +intermediate nodes just perform some specific relay function. + +At some point the GTP packet ends up on the so-called GGSN (GSM/UMTS) +or P-GW (LTE), which terminates the tunnel, decapsulates the packet +and forwards it onto an external packet data network. This can be +public internet, but can also be any private IP network (or even +theoretically some non-IP network like X.25). + +You can find the protocol specification in 3GPP TS 29.060, available +publicly via the 3GPP website at http://www.3gpp.org/DynaReport/29060.htm + +A direct PDF link to v13.6.0 is provided for convenience below: +http://www.etsi.org/deliver/etsi_ts/129000_129099/129060/13.06.00_60/ts_129060v130600p.pdf + +The Linux GTP tunnelling module +=============================== + +The module implements the function of a tunnel endpoint, i.e. it is +able to decapsulate tunneled IP packets in the uplink originated by +the phone, and encapsulate raw IP packets received from the external +packet network in downlink towards the phone. + +It *only* implements the so-called 'user plane', carrying the User-IP +payload, called GTP-U. It does not implement the 'control plane', +which is a signaling protocol used for establishment and teardown of +GTP tunnels (GTP-C). + +So in order to have a working GGSN/P-GW setup, you will need a +userspace program that implements the GTP-C protocol and which then +uses the netlink interface provided by the GTP-U module in the kernel +to configure the kernel module. + +This split architecture follows the tunneling modules of other +protocols, e.g. PPPoE or L2TP, where you also run a userspace daemon +to handle the tunnel establishment, authentication etc. and only the +data plane is accelerated inside the kernel. + +Don't be confused by terminology: The GTP User Plane goes through +kernel accelerated path, while the GTP Control Plane goes to +Userspace :) + +The official homepage of the module is at +https://osmocom.org/projects/linux-kernel-gtp-u/wiki + +Userspace Programs with Linux Kernel GTP-U support +================================================== + +At the time of this writing, there are at least two Free Software +implementations that implement GTP-C and can use the netlink interface +to make use of the Linux kernel GTP-U support: + +* OpenGGSN (classic 2G/3G GGSN in C): + https://osmocom.org/projects/openggsn/wiki/OpenGGSN + +* ergw (GGSN + P-GW in Erlang): + https://github.com/travelping/ergw + +Userspace Library / Command Line Utilities +========================================== + +There is a userspace library called 'libgtpnl' which is based on +libmnl and which implements a C-language API towards the netlink +interface provided by the Kernel GTP module: + +http://git.osmocom.org/libgtpnl/ + +Protocol Versions +================= + +There are two different versions of GTP-U: v0 [GSM TS 09.60] and v1 +[3GPP TS 29.281]. Both are implemented in the Kernel GTP module. +Version 0 is a legacy version, and deprecated from recent 3GPP +specifications. + +GTP-U uses UDP for transporting PDUs. The receiving UDP port is 2151 +for GTPv1-U and 3386 for GTPv0-U. + +There are three versions of GTP-C: v0, v1, and v2. As the kernel +doesn't implement GTP-C, we don't have to worry about this. It's the +responsibility of the control plane implementation in userspace to +implement that. + +IPv6 +==== + +The 3GPP specifications indicate either IPv4 or IPv6 can be used both +on the inner (user) IP layer, or on the outer (transport) layer. + +Unfortunately, the Kernel module currently supports IPv6 neither for +the User IP payload, nor for the outer IP layer. Patches or other +Contributions to fix this are most welcome! + +Mailing List +============ + +If you have questions regarding how to use the Kernel GTP module from +your own software, or want to contribute to the code, please use the +osmocom-net-grps mailing list for related discussion. The list can be +reached at osmocom-net-gprs@lists.osmocom.org and the mailman +interface for managing your subscription is at +https://lists.osmocom.org/mailman/listinfo/osmocom-net-gprs + +Issue Tracker +============= + +The Osmocom project maintains an issue tracker for the Kernel GTP-U +module at +https://osmocom.org/projects/linux-kernel-gtp-u/issues + +History / Acknowledgements +========================== + +The Module was originally created in 2012 by Harald Welte, but never +completed. Pablo came in to finish the mess Harald left behind. But +doe to a lack of user interest, it never got merged. + +In 2015, Andreas Schultz came to the rescue and fixed lots more bugs, +extended it with new features and finally pushed all of us to get it +mainline, where it was merged in 4.7.0. + +Architectural Details +===================== + +Local GTP-U entity and tunnel identification +-------------------------------------------- + +GTP-U uses UDP for transporting PDU's. The receiving UDP port is 2152 +for GTPv1-U and 3386 for GTPv0-U. + +There is only one GTP-U entity (and therefor SGSN/GGSN/S-GW/PDN-GW +instance) per IP address. Tunnel Endpoint Identifier (TEID) are unique +per GTP-U entity. + +A specific tunnel is only defined by the destination entity. Since the +destination port is constant, only the destination IP and TEID define +a tunnel. The source IP and Port have no meaning for the tunnel. + +Therefore: + + * when sending, the remote entity is defined by the remote IP and + the tunnel endpoint id. The source IP and port have no meaning and + can be changed at any time. + + * when receiving the local entity is defined by the local + destination IP and the tunnel endpoint id. The source IP and port + have no meaning and can change at any time. + +[3GPP TS 29.281] Section 4.3.0 defines this so:: + + The TEID in the GTP-U header is used to de-multiplex traffic + incoming from remote tunnel endpoints so that it is delivered to the + User plane entities in a way that allows multiplexing of different + users, different packet protocols and different QoS levels. + Therefore no two remote GTP-U endpoints shall send traffic to a + GTP-U protocol entity using the same TEID value except + for data forwarding as part of mobility procedures. + +The definition above only defines that two remote GTP-U endpoints +*should not* send to the same TEID, it *does not* forbid or exclude +such a scenario. In fact, the mentioned mobility procedures make it +necessary that the GTP-U entity accepts traffic for TEIDs from +multiple or unknown peers. + +Therefore, the receiving side identifies tunnels exclusively based on +TEIDs, not based on the source IP! + +APN vs. Network Device +====================== + +The GTP-U driver creates a Linux network device for each Gi/SGi +interface. + +[3GPP TS 29.281] calls the Gi/SGi reference point an interface. This +may lead to the impression that the GGSN/P-GW can have only one such +interface. + +Correct is that the Gi/SGi reference point defines the interworking +between +the 3GPP packet domain (PDN) based on GTP-U tunnel and IP +based networks. + +There is no provision in any of the 3GPP documents that limits the +number of Gi/SGi interfaces implemented by a GGSN/P-GW. + +[3GPP TS 29.061] Section 11.3 makes it clear that the selection of a +specific Gi/SGi interfaces is made through the Access Point Name +(APN):: + + 2. each private network manages its own addressing. In general this + will result in different private networks having overlapping + address ranges. A logically separate connection (e.g. an IP in IP + tunnel or layer 2 virtual circuit) is used between the GGSN/P-GW + and each private network. + + In this case the IP address alone is not necessarily unique. The + pair of values, Access Point Name (APN) and IPv4 address and/or + IPv6 prefixes, is unique. + +In order to support the overlapping address range use case, each APN +is mapped to a separate Gi/SGi interface (network device). + +.. note:: + + The Access Point Name is purely a control plane (GTP-C) concept. + At the GTP-U level, only Tunnel Endpoint Identifiers are present in + GTP-U packets and network devices are known + +Therefore for a given UE the mapping in IP to PDN network is: + + * network device + MS IP -> Peer IP + Peer TEID, + +and from PDN to IP network: + + * local GTP-U IP + TEID -> network device + +Furthermore, before a received T-PDU is injected into the network +device the MS IP is checked against the IP recorded in PDP context. diff --git a/Documentation/networking/gtp.txt b/Documentation/networking/gtp.txt deleted file mode 100644 index 6966bbec1ecb..000000000000 --- a/Documentation/networking/gtp.txt +++ /dev/null @@ -1,230 +0,0 @@ -The Linux kernel GTP tunneling module -====================================================================== -Documentation by Harald Welte and - Andreas Schultz - -In 'drivers/net/gtp.c' you are finding a kernel-level implementation -of a GTP tunnel endpoint. - -== What is GTP == - -GTP is the Generic Tunnel Protocol, which is a 3GPP protocol used for -tunneling User-IP payload between a mobile station (phone, modem) -and the interconnection between an external packet data network (such -as the internet). - -So when you start a 'data connection' from your mobile phone, the -phone will use the control plane to signal for the establishment of -such a tunnel between that external data network and the phone. The -tunnel endpoints thus reside on the phone and in the gateway. All -intermediate nodes just transport the encapsulated packet. - -The phone itself does not implement GTP but uses some other -technology-dependent protocol stack for transmitting the user IP -payload, such as LLC/SNDCP/RLC/MAC. - -At some network element inside the cellular operator infrastructure -(SGSN in case of GPRS/EGPRS or classic UMTS, hNodeB in case of a 3G -femtocell, eNodeB in case of 4G/LTE), the cellular protocol stacking -is translated into GTP *without breaking the end-to-end tunnel*. So -intermediate nodes just perform some specific relay function. - -At some point the GTP packet ends up on the so-called GGSN (GSM/UMTS) -or P-GW (LTE), which terminates the tunnel, decapsulates the packet -and forwards it onto an external packet data network. This can be -public internet, but can also be any private IP network (or even -theoretically some non-IP network like X.25). - -You can find the protocol specification in 3GPP TS 29.060, available -publicly via the 3GPP website at http://www.3gpp.org/DynaReport/29060.htm - -A direct PDF link to v13.6.0 is provided for convenience below: -http://www.etsi.org/deliver/etsi_ts/129000_129099/129060/13.06.00_60/ts_129060v130600p.pdf - -== The Linux GTP tunnelling module == - -The module implements the function of a tunnel endpoint, i.e. it is -able to decapsulate tunneled IP packets in the uplink originated by -the phone, and encapsulate raw IP packets received from the external -packet network in downlink towards the phone. - -It *only* implements the so-called 'user plane', carrying the User-IP -payload, called GTP-U. It does not implement the 'control plane', -which is a signaling protocol used for establishment and teardown of -GTP tunnels (GTP-C). - -So in order to have a working GGSN/P-GW setup, you will need a -userspace program that implements the GTP-C protocol and which then -uses the netlink interface provided by the GTP-U module in the kernel -to configure the kernel module. - -This split architecture follows the tunneling modules of other -protocols, e.g. PPPoE or L2TP, where you also run a userspace daemon -to handle the tunnel establishment, authentication etc. and only the -data plane is accelerated inside the kernel. - -Don't be confused by terminology: The GTP User Plane goes through -kernel accelerated path, while the GTP Control Plane goes to -Userspace :) - -The official homepage of the module is at -https://osmocom.org/projects/linux-kernel-gtp-u/wiki - -== Userspace Programs with Linux Kernel GTP-U support == - -At the time of this writing, there are at least two Free Software -implementations that implement GTP-C and can use the netlink interface -to make use of the Linux kernel GTP-U support: - -* OpenGGSN (classic 2G/3G GGSN in C): - https://osmocom.org/projects/openggsn/wiki/OpenGGSN - -* ergw (GGSN + P-GW in Erlang): - https://github.com/travelping/ergw - -== Userspace Library / Command Line Utilities == - -There is a userspace library called 'libgtpnl' which is based on -libmnl and which implements a C-language API towards the netlink -interface provided by the Kernel GTP module: - -http://git.osmocom.org/libgtpnl/ - -== Protocol Versions == - -There are two different versions of GTP-U: v0 [GSM TS 09.60] and v1 -[3GPP TS 29.281]. Both are implemented in the Kernel GTP module. -Version 0 is a legacy version, and deprecated from recent 3GPP -specifications. - -GTP-U uses UDP for transporting PDUs. The receiving UDP port is 2151 -for GTPv1-U and 3386 for GTPv0-U. - -There are three versions of GTP-C: v0, v1, and v2. As the kernel -doesn't implement GTP-C, we don't have to worry about this. It's the -responsibility of the control plane implementation in userspace to -implement that. - -== IPv6 == - -The 3GPP specifications indicate either IPv4 or IPv6 can be used both -on the inner (user) IP layer, or on the outer (transport) layer. - -Unfortunately, the Kernel module currently supports IPv6 neither for -the User IP payload, nor for the outer IP layer. Patches or other -Contributions to fix this are most welcome! - -== Mailing List == - -If yo have questions regarding how to use the Kernel GTP module from -your own software, or want to contribute to the code, please use the -osmocom-net-grps mailing list for related discussion. The list can be -reached at osmocom-net-gprs@lists.osmocom.org and the mailman -interface for managing your subscription is at -https://lists.osmocom.org/mailman/listinfo/osmocom-net-gprs - -== Issue Tracker == - -The Osmocom project maintains an issue tracker for the Kernel GTP-U -module at -https://osmocom.org/projects/linux-kernel-gtp-u/issues - -== History / Acknowledgements == - -The Module was originally created in 2012 by Harald Welte, but never -completed. Pablo came in to finish the mess Harald left behind. But -doe to a lack of user interest, it never got merged. - -In 2015, Andreas Schultz came to the rescue and fixed lots more bugs, -extended it with new features and finally pushed all of us to get it -mainline, where it was merged in 4.7.0. - -== Architectural Details == - -=== Local GTP-U entity and tunnel identification === - -GTP-U uses UDP for transporting PDU's. The receiving UDP port is 2152 -for GTPv1-U and 3386 for GTPv0-U. - -There is only one GTP-U entity (and therefor SGSN/GGSN/S-GW/PDN-GW -instance) per IP address. Tunnel Endpoint Identifier (TEID) are unique -per GTP-U entity. - -A specific tunnel is only defined by the destination entity. Since the -destination port is constant, only the destination IP and TEID define -a tunnel. The source IP and Port have no meaning for the tunnel. - -Therefore: - - * when sending, the remote entity is defined by the remote IP and - the tunnel endpoint id. The source IP and port have no meaning and - can be changed at any time. - - * when receiving the local entity is defined by the local - destination IP and the tunnel endpoint id. The source IP and port - have no meaning and can change at any time. - -[3GPP TS 29.281] Section 4.3.0 defines this so: - -> The TEID in the GTP-U header is used to de-multiplex traffic -> incoming from remote tunnel endpoints so that it is delivered to the -> User plane entities in a way that allows multiplexing of different -> users, different packet protocols and different QoS levels. -> Therefore no two remote GTP-U endpoints shall send traffic to a -> GTP-U protocol entity using the same TEID value except -> for data forwarding as part of mobility procedures. - -The definition above only defines that two remote GTP-U endpoints -*should not* send to the same TEID, it *does not* forbid or exclude -such a scenario. In fact, the mentioned mobility procedures make it -necessary that the GTP-U entity accepts traffic for TEIDs from -multiple or unknown peers. - -Therefore, the receiving side identifies tunnels exclusively based on -TEIDs, not based on the source IP! - -== APN vs. Network Device == - -The GTP-U driver creates a Linux network device for each Gi/SGi -interface. - -[3GPP TS 29.281] calls the Gi/SGi reference point an interface. This -may lead to the impression that the GGSN/P-GW can have only one such -interface. - -Correct is that the Gi/SGi reference point defines the interworking -between +the 3GPP packet domain (PDN) based on GTP-U tunnel and IP -based networks. - -There is no provision in any of the 3GPP documents that limits the -number of Gi/SGi interfaces implemented by a GGSN/P-GW. - -[3GPP TS 29.061] Section 11.3 makes it clear that the selection of a -specific Gi/SGi interfaces is made through the Access Point Name -(APN): - -> 2. each private network manages its own addressing. In general this -> will result in different private networks having overlapping -> address ranges. A logically separate connection (e.g. an IP in IP -> tunnel or layer 2 virtual circuit) is used between the GGSN/P-GW -> and each private network. -> -> In this case the IP address alone is not necessarily unique. The -> pair of values, Access Point Name (APN) and IPv4 address and/or -> IPv6 prefixes, is unique. - -In order to support the overlapping address range use case, each APN -is mapped to a separate Gi/SGi interface (network device). - -NOTE: The Access Point Name is purely a control plane (GTP-C) concept. -At the GTP-U level, only Tunnel Endpoint Identifiers are present in -GTP-U packets and network devices are known - -Therefore for a given UE the mapping in IP to PDN network is: - * network device + MS IP -> Peer IP + Peer TEID, - -and from PDN to IP network: - * local GTP-U IP + TEID -> network device - -Furthermore, before a received T-PDU is injected into the network -device the MS IP is checked against the IP recorded in PDP context. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 33afbb67f3fa..b29a08d1f941 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -62,6 +62,7 @@ Contents: generic-hdlc generic_netlink gen_stats + gtp .. only:: subproject and html -- cgit v1.2.3 From 3c3a2fde4d88bb3d6c0592b4b7754f26dab9f697 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:43 +0200 Subject: docs: networking: convert hinic.txt to ReST Not much to be done here: - add SPDX header; - adjust titles and chapters, adding proper markups; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/hinic.rst | 128 +++++++++++++++++++++++++++++++++++++ Documentation/networking/hinic.txt | 125 ------------------------------------ Documentation/networking/index.rst | 1 + MAINTAINERS | 2 +- 4 files changed, 130 insertions(+), 126 deletions(-) create mode 100644 Documentation/networking/hinic.rst delete mode 100644 Documentation/networking/hinic.txt (limited to 'Documentation') diff --git a/Documentation/networking/hinic.rst b/Documentation/networking/hinic.rst new file mode 100644 index 000000000000..867ac8f4e04a --- /dev/null +++ b/Documentation/networking/hinic.rst @@ -0,0 +1,128 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================================ +Linux Kernel Driver for Huawei Intelligent NIC(HiNIC) family +============================================================ + +Overview: +========= +HiNIC is a network interface card for the Data Center Area. + +The driver supports a range of link-speed devices (10GbE, 25GbE, 40GbE, etc.). +The driver supports also a negotiated and extendable feature set. + +Some HiNIC devices support SR-IOV. This driver is used for Physical Function +(PF). + +HiNIC devices support MSI-X interrupt vector for each Tx/Rx queue and +adaptive interrupt moderation. + +HiNIC devices support also various offload features such as checksum offload, +TCP Transmit Segmentation Offload(TSO), Receive-Side Scaling(RSS) and +LRO(Large Receive Offload). + + +Supported PCI vendor ID/device IDs: +=================================== + +19e5:1822 - HiNIC PF + + +Driver Architecture and Source Code: +==================================== + +hinic_dev - Implement a Logical Network device that is independent from +specific HW details about HW data structure formats. + +hinic_hwdev - Implement the HW details of the device and include the components +for accessing the PCI NIC. + +hinic_hwdev contains the following components: +=============================================== + +HW Interface: +============= + +The interface for accessing the pci device (DMA memory and PCI BARs). +(hinic_hw_if.c, hinic_hw_if.h) + +Configuration Status Registers Area that describes the HW Registers on the +configuration and status BAR0. (hinic_hw_csr.h) + +MGMT components: +================ + +Asynchronous Event Queues(AEQs) - The event queues for receiving messages from +the MGMT modules on the cards. (hinic_hw_eqs.c, hinic_hw_eqs.h) + +Application Programmable Interface commands(API CMD) - Interface for sending +MGMT commands to the card. (hinic_hw_api_cmd.c, hinic_hw_api_cmd.h) + +Management (MGMT) - the PF to MGMT channel that uses API CMD for sending MGMT +commands to the card and receives notifications from the MGMT modules on the +card by AEQs. Also set the addresses of the IO CMDQs in HW. +(hinic_hw_mgmt.c, hinic_hw_mgmt.h) + +IO components: +============== + +Completion Event Queues(CEQs) - The completion Event Queues that describe IO +tasks that are finished. (hinic_hw_eqs.c, hinic_hw_eqs.h) + +Work Queues(WQ) - Contain the memory and operations for use by CMD queues and +the Queue Pairs. The WQ is a Memory Block in a Page. The Block contains +pointers to Memory Areas that are the Memory for the Work Queue Elements(WQEs). +(hinic_hw_wq.c, hinic_hw_wq.h) + +Command Queues(CMDQ) - The queues for sending commands for IO management and is +used to set the QPs addresses in HW. The commands completion events are +accumulated on the CEQ that is configured to receive the CMDQ completion events. +(hinic_hw_cmdq.c, hinic_hw_cmdq.h) + +Queue Pairs(QPs) - The HW Receive and Send queues for Receiving and Transmitting +Data. (hinic_hw_qp.c, hinic_hw_qp.h, hinic_hw_qp_ctxt.h) + +IO - de/constructs all the IO components. (hinic_hw_io.c, hinic_hw_io.h) + +HW device: +========== + +HW device - de/constructs the HW Interface, the MGMT components on the +initialization of the driver and the IO components on the case of Interface +UP/DOWN Events. (hinic_hw_dev.c, hinic_hw_dev.h) + + +hinic_dev contains the following components: +=============================================== + +PCI ID table - Contains the supported PCI Vendor/Device IDs. +(hinic_pci_tbl.h) + +Port Commands - Send commands to the HW device for port management +(MAC, Vlan, MTU, ...). (hinic_port.c, hinic_port.h) + +Tx Queues - Logical Tx Queues that use the HW Send Queues for transmit. +The Logical Tx queue is not dependent on the format of the HW Send Queue. +(hinic_tx.c, hinic_tx.h) + +Rx Queues - Logical Rx Queues that use the HW Receive Queues for receive. +The Logical Rx queue is not dependent on the format of the HW Receive Queue. +(hinic_rx.c, hinic_rx.h) + +hinic_dev - de/constructs the Logical Tx and Rx Queues. +(hinic_main.c, hinic_dev.h) + + +Miscellaneous +============= + +Common functions that are used by HW and Logical Device. +(hinic_common.c, hinic_common.h) + + +Support +======= + +If an issue is identified with the released source code on the supported kernel +with a supported adapter, email the specific information related to the issue to +aviad.krawczyk@huawei.com. diff --git a/Documentation/networking/hinic.txt b/Documentation/networking/hinic.txt deleted file mode 100644 index 989366a4039c..000000000000 --- a/Documentation/networking/hinic.txt +++ /dev/null @@ -1,125 +0,0 @@ -Linux Kernel Driver for Huawei Intelligent NIC(HiNIC) family -============================================================ - -Overview: -========= -HiNIC is a network interface card for the Data Center Area. - -The driver supports a range of link-speed devices (10GbE, 25GbE, 40GbE, etc.). -The driver supports also a negotiated and extendable feature set. - -Some HiNIC devices support SR-IOV. This driver is used for Physical Function -(PF). - -HiNIC devices support MSI-X interrupt vector for each Tx/Rx queue and -adaptive interrupt moderation. - -HiNIC devices support also various offload features such as checksum offload, -TCP Transmit Segmentation Offload(TSO), Receive-Side Scaling(RSS) and -LRO(Large Receive Offload). - - -Supported PCI vendor ID/device IDs: -=================================== - -19e5:1822 - HiNIC PF - - -Driver Architecture and Source Code: -==================================== - -hinic_dev - Implement a Logical Network device that is independent from -specific HW details about HW data structure formats. - -hinic_hwdev - Implement the HW details of the device and include the components -for accessing the PCI NIC. - -hinic_hwdev contains the following components: -=============================================== - -HW Interface: -============= - -The interface for accessing the pci device (DMA memory and PCI BARs). -(hinic_hw_if.c, hinic_hw_if.h) - -Configuration Status Registers Area that describes the HW Registers on the -configuration and status BAR0. (hinic_hw_csr.h) - -MGMT components: -================ - -Asynchronous Event Queues(AEQs) - The event queues for receiving messages from -the MGMT modules on the cards. (hinic_hw_eqs.c, hinic_hw_eqs.h) - -Application Programmable Interface commands(API CMD) - Interface for sending -MGMT commands to the card. (hinic_hw_api_cmd.c, hinic_hw_api_cmd.h) - -Management (MGMT) - the PF to MGMT channel that uses API CMD for sending MGMT -commands to the card and receives notifications from the MGMT modules on the -card by AEQs. Also set the addresses of the IO CMDQs in HW. -(hinic_hw_mgmt.c, hinic_hw_mgmt.h) - -IO components: -============== - -Completion Event Queues(CEQs) - The completion Event Queues that describe IO -tasks that are finished. (hinic_hw_eqs.c, hinic_hw_eqs.h) - -Work Queues(WQ) - Contain the memory and operations for use by CMD queues and -the Queue Pairs. The WQ is a Memory Block in a Page. The Block contains -pointers to Memory Areas that are the Memory for the Work Queue Elements(WQEs). -(hinic_hw_wq.c, hinic_hw_wq.h) - -Command Queues(CMDQ) - The queues for sending commands for IO management and is -used to set the QPs addresses in HW. The commands completion events are -accumulated on the CEQ that is configured to receive the CMDQ completion events. -(hinic_hw_cmdq.c, hinic_hw_cmdq.h) - -Queue Pairs(QPs) - The HW Receive and Send queues for Receiving and Transmitting -Data. (hinic_hw_qp.c, hinic_hw_qp.h, hinic_hw_qp_ctxt.h) - -IO - de/constructs all the IO components. (hinic_hw_io.c, hinic_hw_io.h) - -HW device: -========== - -HW device - de/constructs the HW Interface, the MGMT components on the -initialization of the driver and the IO components on the case of Interface -UP/DOWN Events. (hinic_hw_dev.c, hinic_hw_dev.h) - - -hinic_dev contains the following components: -=============================================== - -PCI ID table - Contains the supported PCI Vendor/Device IDs. -(hinic_pci_tbl.h) - -Port Commands - Send commands to the HW device for port management -(MAC, Vlan, MTU, ...). (hinic_port.c, hinic_port.h) - -Tx Queues - Logical Tx Queues that use the HW Send Queues for transmit. -The Logical Tx queue is not dependent on the format of the HW Send Queue. -(hinic_tx.c, hinic_tx.h) - -Rx Queues - Logical Rx Queues that use the HW Receive Queues for receive. -The Logical Rx queue is not dependent on the format of the HW Receive Queue. -(hinic_rx.c, hinic_rx.h) - -hinic_dev - de/constructs the Logical Tx and Rx Queues. -(hinic_main.c, hinic_dev.h) - - -Miscellaneous: -============= - -Common functions that are used by HW and Logical Device. -(hinic_common.c, hinic_common.h) - - -Support -======= - -If an issue is identified with the released source code on the supported kernel -with a supported adapter, email the specific information related to the issue to -aviad.krawczyk@huawei.com. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index b29a08d1f941..5a7889df1375 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -63,6 +63,7 @@ Contents: generic_netlink gen_stats gtp + hinic .. only:: subproject and html diff --git a/MAINTAINERS b/MAINTAINERS index 4ec6d2741d36..df5e4ccc1ccb 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -7815,7 +7815,7 @@ HUAWEI ETHERNET DRIVER M: Aviad Krawczyk L: netdev@vger.kernel.org S: Supported -F: Documentation/networking/hinic.txt +F: Documentation/networking/hinic.rst F: drivers/net/ethernet/huawei/hinic/ HUGETLB FILESYSTEM -- cgit v1.2.3 From 1d2698fa05f57ba2900e1ff50ac33ec85d2087d3 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:44 +0200 Subject: docs: networking: convert ila.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/ila.rst | 296 +++++++++++++++++++++++++++++++++++++ Documentation/networking/ila.txt | 285 ----------------------------------- Documentation/networking/index.rst | 1 + 3 files changed, 297 insertions(+), 285 deletions(-) create mode 100644 Documentation/networking/ila.rst delete mode 100644 Documentation/networking/ila.txt (limited to 'Documentation') diff --git a/Documentation/networking/ila.rst b/Documentation/networking/ila.rst new file mode 100644 index 000000000000..5ac0a6270b17 --- /dev/null +++ b/Documentation/networking/ila.rst @@ -0,0 +1,296 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================================== +Identifier Locator Addressing (ILA) +=================================== + + +Introduction +============ + +Identifier-locator addressing (ILA) is a technique used with IPv6 that +differentiates between location and identity of a network node. Part of an +address expresses the immutable identity of the node, and another part +indicates the location of the node which can be dynamic. Identifier-locator +addressing can be used to efficiently implement overlay networks for +network virtualization as well as solutions for use cases in mobility. + +ILA can be thought of as means to implement an overlay network without +encapsulation. This is accomplished by performing network address +translation on destination addresses as a packet traverses a network. To +the network, an ILA translated packet appears to be no different than any +other IPv6 packet. For instance, if the transport protocol is TCP then an +ILA translated packet looks like just another TCP/IPv6 packet. The +advantage of this is that ILA is transparent to the network so that +optimizations in the network, such as ECMP, RSS, GRO, GSO, etc., just work. + +The ILA protocol is described in Internet-Draft draft-herbert-intarea-ila. + + +ILA terminology +=============== + + - Identifier + A number that identifies an addressable node in the network + independent of its location. ILA identifiers are sixty-four + bit values. + + - Locator + A network prefix that routes to a physical host. Locators + provide the topological location of an addressed node. ILA + locators are sixty-four bit prefixes. + + - ILA mapping + A mapping of an ILA identifier to a locator (or to a + locator and meta data). An ILA domain maintains a database + that contains mappings for all destinations in the domain. + + - SIR address + An IPv6 address composed of a SIR prefix (upper sixty- + four bits) and an identifier (lower sixty-four bits). + SIR addresses are visible to applications and provide a + means for them to address nodes independent of their + location. + + - ILA address + An IPv6 address composed of a locator (upper sixty-four + bits) and an identifier (low order sixty-four bits). ILA + addresses are never visible to an application. + + - ILA host + An end host that is capable of performing ILA translations + on transmit or receive. + + - ILA router + A network node that performs ILA translation and forwarding + of translated packets. + + - ILA forwarding cache + A type of ILA router that only maintains a working set + cache of mappings. + + - ILA node + A network node capable of performing ILA translations. This + can be an ILA router, ILA forwarding cache, or ILA host. + + +Operation +========= + +There are two fundamental operations with ILA: + + - Translate a SIR address to an ILA address. This is performed on ingress + to an ILA overlay. + + - Translate an ILA address to a SIR address. This is performed on egress + from the ILA overlay. + +ILA can be deployed either on end hosts or intermediate devices in the +network; these are provided by "ILA hosts" and "ILA routers" respectively. +Configuration and datapath for these two points of deployment is somewhat +different. + +The diagram below illustrates the flow of packets through ILA as well +as showing ILA hosts and routers:: + + +--------+ +--------+ + | Host A +-+ +--->| Host B | + | | | (2) ILA (') | | + +--------+ | ...addressed.... ( ) +--------+ + V +---+--+ . packet . +---+--+ (_) + (1) SIR | | ILA |----->-------->---->| ILA | | (3) SIR + addressed +->|router| . . |router|->-+ addressed + packet +---+--+ . IPv6 . +---+--+ packet + / . Network . + / . . +--+-++--------+ + +--------+ / . . |ILA || Host | + | Host +--+ . .- -|host|| | + | | . . +--+-++--------+ + +--------+ ................ + + +Transport checksum handling +=========================== + +When an address is translated by ILA, an encapsulated transport checksum +that includes the translated address in a pseudo header may be rendered +incorrect on the wire. This is a problem for intermediate devices, +including checksum offload in NICs, that process the checksum. There are +three options to deal with this: + +- no action Allow the checksum to be incorrect on the wire. Before + a receiver verifies a checksum the ILA to SIR address + translation must be done. + +- adjust transport checksum + When ILA translation is performed the packet is parsed + and if a transport layer checksum is found then it is + adjusted to reflect the correct checksum per the + translated address. + +- checksum neutral mapping + When an address is translated the difference can be offset + elsewhere in a part of the packet that is covered by + the checksum. The low order sixteen bits of the identifier + are used. This method is preferred since it doesn't require + parsing a packet beyond the IP header and in most cases the + adjustment can be precomputed and saved with the mapping. + +Note that the checksum neutral adjustment affects the low order sixteen +bits of the identifier. When ILA to SIR address translation is done on +egress the low order bits are restored to the original value which +restores the identifier as it was originally sent. + + +Identifier types +================ + +ILA defines different types of identifiers for different use cases. + +The defined types are: + + 0: interface identifier + + 1: locally unique identifier + + 2: virtual networking identifier for IPv4 address + + 3: virtual networking identifier for IPv6 unicast address + + 4: virtual networking identifier for IPv6 multicast address + + 5: non-local address identifier + +In the current implementation of kernel ILA only locally unique identifiers +(LUID) are supported. LUID allows for a generic, unformatted 64 bit +identifier. + + +Identifier formats +================== + +Kernel ILA supports two optional fields in an identifier for formatting: +"C-bit" and "identifier type". The presence of these fields is determined +by configuration as demonstrated below. + +If the identifier type is present it occupies the three highest order +bits of an identifier. The possible values are given in the above list. + +If the C-bit is present, this is used as an indication that checksum +neutral mapping has been done. The C-bit can only be set in an +ILA address, never a SIR address. + +In the simplest format the identifier types, C-bit, and checksum +adjustment value are not present so an identifier is considered an +unstructured sixty-four bit value:: + + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Identifier | + + + + | | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + +The checksum neutral adjustment may be configured to always be +present using neutral-map-auto. In this case there is no C-bit, but the +checksum adjustment is in the low order 16 bits. The identifier is +still sixty-four bits:: + + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Identifier | + | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | Checksum-neutral adjustment | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + +The C-bit may used to explicitly indicate that checksum neutral +mapping has been applied to an ILA address. The format is:: + + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | |C| Identifier | + | +-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | Checksum-neutral adjustment | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + +The identifier type field may be present to indicate the identifier +type. If it is not present then the type is inferred based on mapping +configuration. The checksum neutral adjustment may automatically +used with the identifier type as illustrated below:: + + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Type| Identifier | + +-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | Checksum-neutral adjustment | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + +If the identifier type and the C-bit can be present simultaneously so +the identifier format would be:: + + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Type|C| Identifier | + +-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | Checksum-neutral adjustment | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + +Configuration +============= + +There are two methods to configure ILA mappings. One is by using LWT routes +and the other is ila_xlat (called from NFHOOK PREROUTING hook). ila_xlat +is intended to be used in the receive path for ILA hosts . + +An ILA router has also been implemented in XDP. Description of that is +outside the scope of this document. + +The usage of for ILA LWT routes is: + +ip route add DEST/128 encap ila LOC csum-mode MODE ident-type TYPE via ADDR + +Destination (DEST) can either be a SIR address (for an ILA host or ingress +ILA router) or an ILA address (egress ILA router). LOC is the sixty-four +bit locator (with format W:X:Y:Z) that overwrites the upper sixty-four +bits of the destination address. Checksum MODE is one of "no-action", +"adj-transport", "neutral-map", and "neutral-map-auto". If neutral-map is +set then the C-bit will be present. Identifier TYPE one of "luid" or +"use-format." In the case of use-format, the identifier type field is +present and the effective type is taken from that. + +The usage of ila_xlat is: + +ip ila add loc_match MATCH loc LOC csum-mode MODE ident-type TYPE + +MATCH indicates the incoming locator that must be matched to apply +a the translaiton. LOC is the locator that overwrites the upper +sixty-four bits of the destination address. MODE and TYPE have the +same meanings as described above. + + +Some examples +============= + +:: + + # Configure an ILA route that uses checksum neutral mapping as well + # as type field. Note that the type field is set in the SIR address + # (the 2000 implies type is 1 which is LUID). + ip route add 3333:0:0:1:2000:0:1:87/128 encap ila 2001:0:87:0 \ + csum-mode neutral-map ident-type use-format + + # Configure an ILA LWT route that uses auto checksum neutral mapping + # (no C-bit) and configure identifier type to be LUID so that the + # identifier type field will not be present. + ip route add 3333:0:0:1:2000:0:2:87/128 encap ila 2001:0:87:1 \ + csum-mode neutral-map-auto ident-type luid + + ila_xlat configuration + + # Configure an ILA to SIR mapping that matches a locator and overwrites + # it with a SIR address (3333:0:0:1 in this example). The C-bit and + # identifier field are used. + ip ila add loc_match 2001:0:119:0 loc 3333:0:0:1 \ + csum-mode neutral-map-auto ident-type use-format + + # Configure an ILA to SIR mapping where checksum neutral is automatically + # set without the C-bit and the identifier type is configured to be LUID + # so that the identifier type field is not present. + ip ila add loc_match 2001:0:119:0 loc 3333:0:0:1 \ + csum-mode neutral-map-auto ident-type use-format diff --git a/Documentation/networking/ila.txt b/Documentation/networking/ila.txt deleted file mode 100644 index a17dac9dc915..000000000000 --- a/Documentation/networking/ila.txt +++ /dev/null @@ -1,285 +0,0 @@ -Identifier Locator Addressing (ILA) - - -Introduction -============ - -Identifier-locator addressing (ILA) is a technique used with IPv6 that -differentiates between location and identity of a network node. Part of an -address expresses the immutable identity of the node, and another part -indicates the location of the node which can be dynamic. Identifier-locator -addressing can be used to efficiently implement overlay networks for -network virtualization as well as solutions for use cases in mobility. - -ILA can be thought of as means to implement an overlay network without -encapsulation. This is accomplished by performing network address -translation on destination addresses as a packet traverses a network. To -the network, an ILA translated packet appears to be no different than any -other IPv6 packet. For instance, if the transport protocol is TCP then an -ILA translated packet looks like just another TCP/IPv6 packet. The -advantage of this is that ILA is transparent to the network so that -optimizations in the network, such as ECMP, RSS, GRO, GSO, etc., just work. - -The ILA protocol is described in Internet-Draft draft-herbert-intarea-ila. - - -ILA terminology -=============== - - - Identifier A number that identifies an addressable node in the network - independent of its location. ILA identifiers are sixty-four - bit values. - - - Locator A network prefix that routes to a physical host. Locators - provide the topological location of an addressed node. ILA - locators are sixty-four bit prefixes. - - - ILA mapping - A mapping of an ILA identifier to a locator (or to a - locator and meta data). An ILA domain maintains a database - that contains mappings for all destinations in the domain. - - - SIR address - An IPv6 address composed of a SIR prefix (upper sixty- - four bits) and an identifier (lower sixty-four bits). - SIR addresses are visible to applications and provide a - means for them to address nodes independent of their - location. - - - ILA address - An IPv6 address composed of a locator (upper sixty-four - bits) and an identifier (low order sixty-four bits). ILA - addresses are never visible to an application. - - - ILA host An end host that is capable of performing ILA translations - on transmit or receive. - - - ILA router A network node that performs ILA translation and forwarding - of translated packets. - - - ILA forwarding cache - A type of ILA router that only maintains a working set - cache of mappings. - - - ILA node A network node capable of performing ILA translations. This - can be an ILA router, ILA forwarding cache, or ILA host. - - -Operation -========= - -There are two fundamental operations with ILA: - - - Translate a SIR address to an ILA address. This is performed on ingress - to an ILA overlay. - - - Translate an ILA address to a SIR address. This is performed on egress - from the ILA overlay. - -ILA can be deployed either on end hosts or intermediate devices in the -network; these are provided by "ILA hosts" and "ILA routers" respectively. -Configuration and datapath for these two points of deployment is somewhat -different. - -The diagram below illustrates the flow of packets through ILA as well -as showing ILA hosts and routers. - - +--------+ +--------+ - | Host A +-+ +--->| Host B | - | | | (2) ILA (') | | - +--------+ | ...addressed.... ( ) +--------+ - V +---+--+ . packet . +---+--+ (_) - (1) SIR | | ILA |----->-------->---->| ILA | | (3) SIR - addressed +->|router| . . |router|->-+ addressed - packet +---+--+ . IPv6 . +---+--+ packet - / . Network . - / . . +--+-++--------+ - +--------+ / . . |ILA || Host | - | Host +--+ . .- -|host|| | - | | . . +--+-++--------+ - +--------+ ................ - - -Transport checksum handling -=========================== - -When an address is translated by ILA, an encapsulated transport checksum -that includes the translated address in a pseudo header may be rendered -incorrect on the wire. This is a problem for intermediate devices, -including checksum offload in NICs, that process the checksum. There are -three options to deal with this: - -- no action Allow the checksum to be incorrect on the wire. Before - a receiver verifies a checksum the ILA to SIR address - translation must be done. - -- adjust transport checksum - When ILA translation is performed the packet is parsed - and if a transport layer checksum is found then it is - adjusted to reflect the correct checksum per the - translated address. - -- checksum neutral mapping - When an address is translated the difference can be offset - elsewhere in a part of the packet that is covered by - the checksum. The low order sixteen bits of the identifier - are used. This method is preferred since it doesn't require - parsing a packet beyond the IP header and in most cases the - adjustment can be precomputed and saved with the mapping. - -Note that the checksum neutral adjustment affects the low order sixteen -bits of the identifier. When ILA to SIR address translation is done on -egress the low order bits are restored to the original value which -restores the identifier as it was originally sent. - - -Identifier types -================ - -ILA defines different types of identifiers for different use cases. - -The defined types are: - - 0: interface identifier - - 1: locally unique identifier - - 2: virtual networking identifier for IPv4 address - - 3: virtual networking identifier for IPv6 unicast address - - 4: virtual networking identifier for IPv6 multicast address - - 5: non-local address identifier - -In the current implementation of kernel ILA only locally unique identifiers -(LUID) are supported. LUID allows for a generic, unformatted 64 bit -identifier. - - -Identifier formats -================== - -Kernel ILA supports two optional fields in an identifier for formatting: -"C-bit" and "identifier type". The presence of these fields is determined -by configuration as demonstrated below. - -If the identifier type is present it occupies the three highest order -bits of an identifier. The possible values are given in the above list. - -If the C-bit is present, this is used as an indication that checksum -neutral mapping has been done. The C-bit can only be set in an -ILA address, never a SIR address. - -In the simplest format the identifier types, C-bit, and checksum -adjustment value are not present so an identifier is considered an -unstructured sixty-four bit value. - - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - | Identifier | - + + - | | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - -The checksum neutral adjustment may be configured to always be -present using neutral-map-auto. In this case there is no C-bit, but the -checksum adjustment is in the low order 16 bits. The identifier is -still sixty-four bits. - - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - | Identifier | - | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - | | Checksum-neutral adjustment | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - -The C-bit may used to explicitly indicate that checksum neutral -mapping has been applied to an ILA address. The format is: - - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - | |C| Identifier | - | +-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - | | Checksum-neutral adjustment | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - -The identifier type field may be present to indicate the identifier -type. If it is not present then the type is inferred based on mapping -configuration. The checksum neutral adjustment may automatically -used with the identifier type as illustrated below. - - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - | Type| Identifier | - +-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - | | Checksum-neutral adjustment | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - -If the identifier type and the C-bit can be present simultaneously so -the identifier format would be: - - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - | Type|C| Identifier | - +-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - | | Checksum-neutral adjustment | - +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - - -Configuration -============= - -There are two methods to configure ILA mappings. One is by using LWT routes -and the other is ila_xlat (called from NFHOOK PREROUTING hook). ila_xlat -is intended to be used in the receive path for ILA hosts . - -An ILA router has also been implemented in XDP. Description of that is -outside the scope of this document. - -The usage of for ILA LWT routes is: - -ip route add DEST/128 encap ila LOC csum-mode MODE ident-type TYPE via ADDR - -Destination (DEST) can either be a SIR address (for an ILA host or ingress -ILA router) or an ILA address (egress ILA router). LOC is the sixty-four -bit locator (with format W:X:Y:Z) that overwrites the upper sixty-four -bits of the destination address. Checksum MODE is one of "no-action", -"adj-transport", "neutral-map", and "neutral-map-auto". If neutral-map is -set then the C-bit will be present. Identifier TYPE one of "luid" or -"use-format." In the case of use-format, the identifier type field is -present and the effective type is taken from that. - -The usage of ila_xlat is: - -ip ila add loc_match MATCH loc LOC csum-mode MODE ident-type TYPE - -MATCH indicates the incoming locator that must be matched to apply -a the translaiton. LOC is the locator that overwrites the upper -sixty-four bits of the destination address. MODE and TYPE have the -same meanings as described above. - - -Some examples -============= - -# Configure an ILA route that uses checksum neutral mapping as well -# as type field. Note that the type field is set in the SIR address -# (the 2000 implies type is 1 which is LUID). -ip route add 3333:0:0:1:2000:0:1:87/128 encap ila 2001:0:87:0 \ - csum-mode neutral-map ident-type use-format - -# Configure an ILA LWT route that uses auto checksum neutral mapping -# (no C-bit) and configure identifier type to be LUID so that the -# identifier type field will not be present. -ip route add 3333:0:0:1:2000:0:2:87/128 encap ila 2001:0:87:1 \ - csum-mode neutral-map-auto ident-type luid - -ila_xlat configuration - -# Configure an ILA to SIR mapping that matches a locator and overwrites -# it with a SIR address (3333:0:0:1 in this example). The C-bit and -# identifier field are used. -ip ila add loc_match 2001:0:119:0 loc 3333:0:0:1 \ - csum-mode neutral-map-auto ident-type use-format - -# Configure an ILA to SIR mapping where checksum neutral is automatically -# set without the C-bit and the identifier type is configured to be LUID -# so that the identifier type field is not present. -ip ila add loc_match 2001:0:119:0 loc 3333:0:0:1 \ - csum-mode neutral-map-auto ident-type use-format diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 5a7889df1375..488971f6b650 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -64,6 +64,7 @@ Contents: gen_stats gtp hinic + ila .. only:: subproject and html -- cgit v1.2.3 From 7cdb25400f7e8624414260d1b0fa70da280b2303 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:45 +0200 Subject: docs: networking: convert ipddp.txt to ReST Not much to be done here: - add SPDX header; - use a document title from existing text; - adjust a chapter markup; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/ipddp.rst | 78 ++++++++++++++++++++++++++++++++++++++ Documentation/networking/ipddp.txt | 73 ----------------------------------- Documentation/networking/ltpc.txt | 2 +- drivers/net/appletalk/Kconfig | 4 +- 5 files changed, 82 insertions(+), 76 deletions(-) create mode 100644 Documentation/networking/ipddp.rst delete mode 100644 Documentation/networking/ipddp.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 488971f6b650..cf85d0a73144 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -65,6 +65,7 @@ Contents: gtp hinic ila + ipddp .. only:: subproject and html diff --git a/Documentation/networking/ipddp.rst b/Documentation/networking/ipddp.rst new file mode 100644 index 000000000000..be7091b77927 --- /dev/null +++ b/Documentation/networking/ipddp.rst @@ -0,0 +1,78 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================================= +AppleTalk-IP Decapsulation and AppleTalk-IP Encapsulation +========================================================= + +Documentation ipddp.c + +This file is written by Jay Schulist + +Introduction +------------ + +AppleTalk-IP (IPDDP) is the method computers connected to AppleTalk +networks can use to communicate via IP. AppleTalk-IP is simply IP datagrams +inside AppleTalk packets. + +Through this driver you can either allow your Linux box to communicate +IP over an AppleTalk network or you can provide IP gatewaying functions +for your AppleTalk users. + +You can currently encapsulate or decapsulate AppleTalk-IP on LocalTalk, +EtherTalk and PPPTalk. The only limit on the protocol is that of what +kernel AppleTalk layer and drivers are available. + +Each mode requires its own user space software. + +Compiling AppleTalk-IP Decapsulation/Encapsulation +================================================== + +AppleTalk-IP decapsulation needs to be compiled into your kernel. You +will need to turn on AppleTalk-IP driver support. Then you will need to +select ONE of the two options; IP to AppleTalk-IP encapsulation support or +AppleTalk-IP to IP decapsulation support. If you compile the driver +statically you will only be able to use the driver for the function you have +enabled in the kernel. If you compile the driver as a module you can +select what mode you want it to run in via a module loading param. +ipddp_mode=1 for AppleTalk-IP encapsulation and ipddp_mode=2 for +AppleTalk-IP to IP decapsulation. + +Basic instructions for user space tools +======================================= + +I will briefly describe the operation of the tools, but you will +need to consult the supporting documentation for each set of tools. + +Decapsulation - You will need to download a software package called +MacGate. In this distribution there will be a tool called MacRoute +which enables you to add routes to the kernel for your Macs by hand. +Also the tool MacRegGateWay is included to register the +proper IP Gateway and IP addresses for your machine. Included in this +distribution is a patch to netatalk-1.4b2+asun2.0a17.2 (available from +ftp.u.washington.edu/pub/user-supported/asun/) this patch is optional +but it allows automatic adding and deleting of routes for Macs. (Handy +for locations with large Mac installations) + +Encapsulation - You will need to download a software daemon called ipddpd. +This software expects there to be an AppleTalk-IP gateway on the network. +You will also need to add the proper routes to route your Linux box's IP +traffic out the ipddp interface. + +Common Uses of ipddp.c +---------------------- +Of course AppleTalk-IP decapsulation and encapsulation, but specifically +decapsulation is being used most for connecting LocalTalk networks to +IP networks. Although it has been used on EtherTalk networks to allow +Macs that are only able to tunnel IP over EtherTalk. + +Encapsulation has been used to allow a Linux box stuck on a LocalTalk +network to use IP. It should work equally well if you are stuck on an +EtherTalk only network. + +Further Assistance +------------------- +You can contact me (Jay Schulist ) with any +questions regarding decapsulation or encapsulation. Bradford W. Johnson + originally wrote the ipddp.c driver for IP +encapsulation in AppleTalk. diff --git a/Documentation/networking/ipddp.txt b/Documentation/networking/ipddp.txt deleted file mode 100644 index ba5c217fffe0..000000000000 --- a/Documentation/networking/ipddp.txt +++ /dev/null @@ -1,73 +0,0 @@ -Text file for ipddp.c: - AppleTalk-IP Decapsulation and AppleTalk-IP Encapsulation - -This text file is written by Jay Schulist - -Introduction ------------- - -AppleTalk-IP (IPDDP) is the method computers connected to AppleTalk -networks can use to communicate via IP. AppleTalk-IP is simply IP datagrams -inside AppleTalk packets. - -Through this driver you can either allow your Linux box to communicate -IP over an AppleTalk network or you can provide IP gatewaying functions -for your AppleTalk users. - -You can currently encapsulate or decapsulate AppleTalk-IP on LocalTalk, -EtherTalk and PPPTalk. The only limit on the protocol is that of what -kernel AppleTalk layer and drivers are available. - -Each mode requires its own user space software. - -Compiling AppleTalk-IP Decapsulation/Encapsulation -================================================= - -AppleTalk-IP decapsulation needs to be compiled into your kernel. You -will need to turn on AppleTalk-IP driver support. Then you will need to -select ONE of the two options; IP to AppleTalk-IP encapsulation support or -AppleTalk-IP to IP decapsulation support. If you compile the driver -statically you will only be able to use the driver for the function you have -enabled in the kernel. If you compile the driver as a module you can -select what mode you want it to run in via a module loading param. -ipddp_mode=1 for AppleTalk-IP encapsulation and ipddp_mode=2 for -AppleTalk-IP to IP decapsulation. - -Basic instructions for user space tools -======================================= - -I will briefly describe the operation of the tools, but you will -need to consult the supporting documentation for each set of tools. - -Decapsulation - You will need to download a software package called -MacGate. In this distribution there will be a tool called MacRoute -which enables you to add routes to the kernel for your Macs by hand. -Also the tool MacRegGateWay is included to register the -proper IP Gateway and IP addresses for your machine. Included in this -distribution is a patch to netatalk-1.4b2+asun2.0a17.2 (available from -ftp.u.washington.edu/pub/user-supported/asun/) this patch is optional -but it allows automatic adding and deleting of routes for Macs. (Handy -for locations with large Mac installations) - -Encapsulation - You will need to download a software daemon called ipddpd. -This software expects there to be an AppleTalk-IP gateway on the network. -You will also need to add the proper routes to route your Linux box's IP -traffic out the ipddp interface. - -Common Uses of ipddp.c ----------------------- -Of course AppleTalk-IP decapsulation and encapsulation, but specifically -decapsulation is being used most for connecting LocalTalk networks to -IP networks. Although it has been used on EtherTalk networks to allow -Macs that are only able to tunnel IP over EtherTalk. - -Encapsulation has been used to allow a Linux box stuck on a LocalTalk -network to use IP. It should work equally well if you are stuck on an -EtherTalk only network. - -Further Assistance -------------------- -You can contact me (Jay Schulist ) with any -questions regarding decapsulation or encapsulation. Bradford W. Johnson - originally wrote the ipddp.c driver for IP -encapsulation in AppleTalk. diff --git a/Documentation/networking/ltpc.txt b/Documentation/networking/ltpc.txt index 0bf3220c715b..a005a73b76d0 100644 --- a/Documentation/networking/ltpc.txt +++ b/Documentation/networking/ltpc.txt @@ -99,7 +99,7 @@ treat the LocalTalk device like an ordinary Ethernet device, even if that's what it looks like to Netatalk. Instead, you follow the same procedure as for doing IP in EtherTalk. -See Documentation/networking/ipddp.txt for more information about the +See Documentation/networking/ipddp.rst for more information about the kernel driver and userspace tools needed. -------------------------------------- diff --git a/drivers/net/appletalk/Kconfig b/drivers/net/appletalk/Kconfig index d4e51c048f62..ccde6479050c 100644 --- a/drivers/net/appletalk/Kconfig +++ b/drivers/net/appletalk/Kconfig @@ -86,7 +86,7 @@ config IPDDP box is stuck on an AppleTalk only network) or decapsulate (e.g. if you want your Linux box to act as an Internet gateway for a zoo of AppleTalk connected Macs). Please see the file - for more information. + for more information. If you say Y here, the AppleTalk-IP support will be compiled into the kernel. In this case, you can either use encapsulation or @@ -107,4 +107,4 @@ config IPDDP_ENCAP IP packets inside AppleTalk frames; this is useful if your Linux box is stuck on an AppleTalk network (which hopefully contains a decapsulator somewhere). Please see - for more information. + for more information. -- cgit v1.2.3 From 9de1fcdf36e7e00693a260865a5f2a58af1c7040 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:46 +0200 Subject: docs: networking: convert ip_dynaddr.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/ip_dynaddr.rst | 40 +++++++++++++++++++++++++++++++++ Documentation/networking/ip_dynaddr.txt | 29 ------------------------ 3 files changed, 41 insertions(+), 29 deletions(-) create mode 100644 Documentation/networking/ip_dynaddr.rst delete mode 100644 Documentation/networking/ip_dynaddr.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index cf85d0a73144..f81aeb87aa28 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -66,6 +66,7 @@ Contents: hinic ila ipddp + ip_dynaddr .. only:: subproject and html diff --git a/Documentation/networking/ip_dynaddr.rst b/Documentation/networking/ip_dynaddr.rst new file mode 100644 index 000000000000..eacc0c780c7f --- /dev/null +++ b/Documentation/networking/ip_dynaddr.rst @@ -0,0 +1,40 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================== +IP dynamic address hack-port v0.03 +================================== + +This stuff allows diald ONESHOT connections to get established by +dynamically changing packet source address (and socket's if local procs). +It is implemented for TCP diald-box connections(1) and IP_MASQuerading(2). + +If enabled\ [#]_ and forwarding interface has changed: + + 1) Socket (and packet) source address is rewritten ON RETRANSMISSIONS + while in SYN_SENT state (diald-box processes). + 2) Out-bounded MASQueraded source address changes ON OUTPUT (when + internal host does retransmission) until a packet from outside is + received by the tunnel. + +This is specially helpful for auto dialup links (diald), where the +``actual`` outgoing address is unknown at the moment the link is +going up. So, the *same* (local AND masqueraded) connections requests that +bring the link up will be able to get established. + +.. [#] At boot, by default no address rewriting is attempted. + + To enable:: + + # echo 1 > /proc/sys/net/ipv4/ip_dynaddr + + To enable verbose mode:: + + # echo 2 > /proc/sys/net/ipv4/ip_dynaddr + + To disable (default):: + + # echo 0 > /proc/sys/net/ipv4/ip_dynaddr + +Enjoy! + +Juanjo diff --git a/Documentation/networking/ip_dynaddr.txt b/Documentation/networking/ip_dynaddr.txt deleted file mode 100644 index 45f3c1268e86..000000000000 --- a/Documentation/networking/ip_dynaddr.txt +++ /dev/null @@ -1,29 +0,0 @@ -IP dynamic address hack-port v0.03 -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -This stuff allows diald ONESHOT connections to get established by -dynamically changing packet source address (and socket's if local procs). -It is implemented for TCP diald-box connections(1) and IP_MASQuerading(2). - -If enabled[*] and forwarding interface has changed: - 1) Socket (and packet) source address is rewritten ON RETRANSMISSIONS - while in SYN_SENT state (diald-box processes). - 2) Out-bounded MASQueraded source address changes ON OUTPUT (when - internal host does retransmission) until a packet from outside is - received by the tunnel. - -This is specially helpful for auto dialup links (diald), where the -``actual'' outgoing address is unknown at the moment the link is -going up. So, the *same* (local AND masqueraded) connections requests that -bring the link up will be able to get established. - -[*] At boot, by default no address rewriting is attempted. - To enable: - # echo 1 > /proc/sys/net/ipv4/ip_dynaddr - To enable verbose mode: - # echo 2 > /proc/sys/net/ipv4/ip_dynaddr - To disable (default) - # echo 0 > /proc/sys/net/ipv4/ip_dynaddr - -Enjoy! - --- Juanjo -- cgit v1.2.3 From aac86c887ed66ac4f467821ebf75373124a148d7 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:47 +0200 Subject: docs: networking: convert iphase.txt to ReST - add SPDX header; - adjust title using the proper markup; - mark code blocks and literals as such; - mark tables as such; - mark lists as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/iphase.rst | 193 ++++++++++++++++++++++++++++++++++++ Documentation/networking/iphase.txt | 158 ----------------------------- drivers/atm/Kconfig | 2 +- 4 files changed, 195 insertions(+), 159 deletions(-) create mode 100644 Documentation/networking/iphase.rst delete mode 100644 Documentation/networking/iphase.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index f81aeb87aa28..505eaa41ca2b 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -67,6 +67,7 @@ Contents: ila ipddp ip_dynaddr + iphase .. only:: subproject and html diff --git a/Documentation/networking/iphase.rst b/Documentation/networking/iphase.rst new file mode 100644 index 000000000000..92d9b757d75a --- /dev/null +++ b/Documentation/networking/iphase.rst @@ -0,0 +1,193 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================== +ATM (i)Chip IA Linux Driver Source +================================== + + READ ME FISRT + +-------------------------------------------------------------------------------- + + Read This Before You Begin! + +-------------------------------------------------------------------------------- + +Description +=========== + +This is the README file for the Interphase PCI ATM (i)Chip IA Linux driver +source release. + +The features and limitations of this driver are as follows: + + - A single VPI (VPI value of 0) is supported. + - Supports 4K VCs for the server board (with 512K control memory) and 1K + VCs for the client board (with 128K control memory). + - UBR, ABR and CBR service categories are supported. + - Only AAL5 is supported. + - Supports setting of PCR on the VCs. + - Multiple adapters in a system are supported. + - All variants of Interphase ATM PCI (i)Chip adapter cards are supported, + including x575 (OC3, control memory 128K , 512K and packet memory 128K, + 512K and 1M), x525 (UTP25) and x531 (DS3 and E3). See + http://www.iphase.com/ + for details. + - Only x86 platforms are supported. + - SMP is supported. + + +Before You Start +================ + + +Installation +------------ + +1. Installing the adapters in the system + + To install the ATM adapters in the system, follow the steps below. + + a. Login as root. + b. Shut down the system and power off the system. + c. Install one or more ATM adapters in the system. + d. Connect each adapter to a port on an ATM switch. The green 'Link' + LED on the front panel of the adapter will be on if the adapter is + connected to the switch properly when the system is powered up. + e. Power on and boot the system. + +2. [ Removed ] + +3. Rebuild kernel with ABR support + + [ a. and b. removed ] + + c. Reconfigure the kernel, choose the Interphase ia driver through "make + menuconfig" or "make xconfig". + d. Rebuild the kernel, loadable modules and the atm tools. + e. Install the new built kernel and modules and reboot. + +4. Load the adapter hardware driver (ia driver) if it is built as a module + + a. Login as root. + b. Change directory to /lib/modules//atm. + c. Run "insmod suni.o;insmod iphase.o" + The yellow 'status' LED on the front panel of the adapter will blink + while the driver is loaded in the system. + d. To verify that the 'ia' driver is loaded successfully, run the + following command:: + + cat /proc/atm/devices + + If the driver is loaded successfully, the output of the command will + be similar to the following lines:: + + Itf Type ESI/"MAC"addr AAL(TX,err,RX,err,drop) ... + 0 ia xxxxxxxxx 0 ( 0 0 0 0 0 ) 5 ( 0 0 0 0 0 ) + + You can also check the system log file /var/log/messages for messages + related to the ATM driver. + +5. Ia Driver Configuration + +5.1 Configuration of adapter buffers + The (i)Chip boards have 3 different packet RAM size variants: 128K, 512K and + 1M. The RAM size decides the number of buffers and buffer size. The default + size and number of buffers are set as following: + + ========= ======= ====== ====== ====== ====== ====== + Total Rx RAM Tx RAM Rx Buf Tx Buf Rx buf Tx buf + RAM size size size size size cnt cnt + ========= ======= ====== ====== ====== ====== ====== + 128K 64K 64K 10K 10K 6 6 + 512K 256K 256K 10K 10K 25 25 + 1M 512K 512K 10K 10K 51 51 + ========= ======= ====== ====== ====== ====== ====== + + These setting should work well in most environments, but can be + changed by typing the following command:: + + insmod /ia.o IA_RX_BUF= IA_RX_BUF_SZ= \ + IA_TX_BUF= IA_TX_BUF_SZ= + + Where: + + - RX_CNT = number of receive buffers in the range (1-128) + - RX_SIZE = size of receive buffers in the range (48-64K) + - TX_CNT = number of transmit buffers in the range (1-128) + - TX_SIZE = size of transmit buffers in the range (48-64K) + + 1. Transmit and receive buffer size must be a multiple of 4. + 2. Care should be taken so that the memory required for the + transmit and receive buffers is less than or equal to the + total adapter packet memory. + +5.2 Turn on ia debug trace + + When the ia driver is built with the CONFIG_ATM_IA_DEBUG flag, the driver + can provide more debug trace if needed. There is a bit mask variable, + IADebugFlag, which controls the output of the traces. You can find the bit + map of the IADebugFlag in iphase.h. + The debug trace can be turn on through the insmod command line option, for + example, "insmod iphase.o IADebugFlag=0xffffffff" can turn on all the debug + traces together with loading the driver. + +6. Ia Driver Test Using ttcp_atm and PVC + + For the PVC setup, the test machines can either be connected back-to-back or + through a switch. If connected through the switch, the switch must be + configured for the PVC(s). + + a. For UBR test: + + At the test machine intended to receive data, type:: + + ttcp_atm -r -a -s 0.100 + + At the other test machine, type:: + + ttcp_atm -t -a -s 0.100 -n 10000 + + Run "ttcp_atm -h" to display more options of the ttcp_atm tool. + b. For ABR test: + + It is the same as the UBR testing, but with an extra command option:: + + -Pabr:max_pcr= + + where: + + xxx = the maximum peak cell rate, from 170 - 353207. + + This option must be set on both the machines. + + c. For CBR test: + + It is the same as the UBR testing, but with an extra command option:: + + -Pcbr:max_pcr= + + where: + + xxx = the maximum peak cell rate, from 170 - 353207. + + This option may only be set on the transmit machine. + + +Outstanding Issues +================== + + + +Contact Information +------------------- + +:: + + Customer Support: + United States: Telephone: (214) 654-5555 + Fax: (214) 654-5500 + E-Mail: intouch@iphase.com + Europe: Telephone: 33 (0)1 41 15 44 00 + Fax: 33 (0)1 41 15 12 13 + World Wide Web: http://www.iphase.com + Anonymous FTP: ftp.iphase.com diff --git a/Documentation/networking/iphase.txt b/Documentation/networking/iphase.txt deleted file mode 100644 index 670b72f16585..000000000000 --- a/Documentation/networking/iphase.txt +++ /dev/null @@ -1,158 +0,0 @@ - - READ ME FISRT - ATM (i)Chip IA Linux Driver Source --------------------------------------------------------------------------------- - Read This Before You Begin! --------------------------------------------------------------------------------- - -Description ------------ - -This is the README file for the Interphase PCI ATM (i)Chip IA Linux driver -source release. - -The features and limitations of this driver are as follows: - - A single VPI (VPI value of 0) is supported. - - Supports 4K VCs for the server board (with 512K control memory) and 1K - VCs for the client board (with 128K control memory). - - UBR, ABR and CBR service categories are supported. - - Only AAL5 is supported. - - Supports setting of PCR on the VCs. - - Multiple adapters in a system are supported. - - All variants of Interphase ATM PCI (i)Chip adapter cards are supported, - including x575 (OC3, control memory 128K , 512K and packet memory 128K, - 512K and 1M), x525 (UTP25) and x531 (DS3 and E3). See - http://www.iphase.com/ - for details. - - Only x86 platforms are supported. - - SMP is supported. - - -Before You Start ----------------- - - -Installation ------------- - -1. Installing the adapters in the system - To install the ATM adapters in the system, follow the steps below. - a. Login as root. - b. Shut down the system and power off the system. - c. Install one or more ATM adapters in the system. - d. Connect each adapter to a port on an ATM switch. The green 'Link' - LED on the front panel of the adapter will be on if the adapter is - connected to the switch properly when the system is powered up. - e. Power on and boot the system. - -2. [ Removed ] - -3. Rebuild kernel with ABR support - [ a. and b. removed ] - c. Reconfigure the kernel, choose the Interphase ia driver through "make - menuconfig" or "make xconfig". - d. Rebuild the kernel, loadable modules and the atm tools. - e. Install the new built kernel and modules and reboot. - -4. Load the adapter hardware driver (ia driver) if it is built as a module - a. Login as root. - b. Change directory to /lib/modules//atm. - c. Run "insmod suni.o;insmod iphase.o" - The yellow 'status' LED on the front panel of the adapter will blink - while the driver is loaded in the system. - d. To verify that the 'ia' driver is loaded successfully, run the - following command: - - cat /proc/atm/devices - - If the driver is loaded successfully, the output of the command will - be similar to the following lines: - - Itf Type ESI/"MAC"addr AAL(TX,err,RX,err,drop) ... - 0 ia xxxxxxxxx 0 ( 0 0 0 0 0 ) 5 ( 0 0 0 0 0 ) - - You can also check the system log file /var/log/messages for messages - related to the ATM driver. - -5. Ia Driver Configuration - -5.1 Configuration of adapter buffers - The (i)Chip boards have 3 different packet RAM size variants: 128K, 512K and - 1M. The RAM size decides the number of buffers and buffer size. The default - size and number of buffers are set as following: - - Total Rx RAM Tx RAM Rx Buf Tx Buf Rx buf Tx buf - RAM size size size size size cnt cnt - -------- ------ ------ ------ ------ ------ ------ - 128K 64K 64K 10K 10K 6 6 - 512K 256K 256K 10K 10K 25 25 - 1M 512K 512K 10K 10K 51 51 - - These setting should work well in most environments, but can be - changed by typing the following command: - - insmod /ia.o IA_RX_BUF= IA_RX_BUF_SZ= \ - IA_TX_BUF= IA_TX_BUF_SZ= - Where: - RX_CNT = number of receive buffers in the range (1-128) - RX_SIZE = size of receive buffers in the range (48-64K) - TX_CNT = number of transmit buffers in the range (1-128) - TX_SIZE = size of transmit buffers in the range (48-64K) - - 1. Transmit and receive buffer size must be a multiple of 4. - 2. Care should be taken so that the memory required for the - transmit and receive buffers is less than or equal to the - total adapter packet memory. - -5.2 Turn on ia debug trace - - When the ia driver is built with the CONFIG_ATM_IA_DEBUG flag, the driver - can provide more debug trace if needed. There is a bit mask variable, - IADebugFlag, which controls the output of the traces. You can find the bit - map of the IADebugFlag in iphase.h. - The debug trace can be turn on through the insmod command line option, for - example, "insmod iphase.o IADebugFlag=0xffffffff" can turn on all the debug - traces together with loading the driver. - -6. Ia Driver Test Using ttcp_atm and PVC - - For the PVC setup, the test machines can either be connected back-to-back or - through a switch. If connected through the switch, the switch must be - configured for the PVC(s). - - a. For UBR test: - At the test machine intended to receive data, type: - ttcp_atm -r -a -s 0.100 - At the other test machine, type: - ttcp_atm -t -a -s 0.100 -n 10000 - Run "ttcp_atm -h" to display more options of the ttcp_atm tool. - b. For ABR test: - It is the same as the UBR testing, but with an extra command option: - -Pabr:max_pcr= - where: - xxx = the maximum peak cell rate, from 170 - 353207. - This option must be set on both the machines. - c. For CBR test: - It is the same as the UBR testing, but with an extra command option: - -Pcbr:max_pcr= - where: - xxx = the maximum peak cell rate, from 170 - 353207. - This option may only be set on the transmit machine. - - -OUTSTANDING ISSUES ------------------- - - - -Contact Information -------------------- - - Customer Support: - United States: Telephone: (214) 654-5555 - Fax: (214) 654-5500 - E-Mail: intouch@iphase.com - Europe: Telephone: 33 (0)1 41 15 44 00 - Fax: 33 (0)1 41 15 12 13 - World Wide Web: http://www.iphase.com - Anonymous FTP: ftp.iphase.com diff --git a/drivers/atm/Kconfig b/drivers/atm/Kconfig index 4af7cbdcc349..cfb0d16b60ad 100644 --- a/drivers/atm/Kconfig +++ b/drivers/atm/Kconfig @@ -306,7 +306,7 @@ config ATM_IA for more info about the cards. Say Y (or M to compile as a module named iphase) here if you have one of these cards. - See the file for further + See the file for further details. config ATM_IA_DEBUG -- cgit v1.2.3 From 355e656e017c3b42deb57d125d86c4cbd277d6db Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:48 +0200 Subject: docs: networking: convert ipsec.txt to ReST Not much to be done here: - add SPDX header; - add a document title; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/ipsec.rst | 46 ++++++++++++++++++++++++++++++++++++++ Documentation/networking/ipsec.txt | 38 ------------------------------- 3 files changed, 47 insertions(+), 38 deletions(-) create mode 100644 Documentation/networking/ipsec.rst delete mode 100644 Documentation/networking/ipsec.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 505eaa41ca2b..3efb4608649a 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -68,6 +68,7 @@ Contents: ipddp ip_dynaddr iphase + ipsec .. only:: subproject and html diff --git a/Documentation/networking/ipsec.rst b/Documentation/networking/ipsec.rst new file mode 100644 index 000000000000..afe9d7b48be3 --- /dev/null +++ b/Documentation/networking/ipsec.rst @@ -0,0 +1,46 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===== +IPsec +===== + + +Here documents known IPsec corner cases which need to be keep in mind when +deploy various IPsec configuration in real world production environment. + +1. IPcomp: + Small IP packet won't get compressed at sender, and failed on + policy check on receiver. + +Quote from RFC3173:: + + 2.2. Non-Expansion Policy + + If the total size of a compressed payload and the IPComp header, as + defined in section 3, is not smaller than the size of the original + payload, the IP datagram MUST be sent in the original non-compressed + form. To clarify: If an IP datagram is sent non-compressed, no + + IPComp header is added to the datagram. This policy ensures saving + the decompression processing cycles and avoiding incurring IP + datagram fragmentation when the expanded datagram is larger than the + MTU. + + Small IP datagrams are likely to expand as a result of compression. + Therefore, a numeric threshold should be applied before compression, + where IP datagrams of size smaller than the threshold are sent in the + original form without attempting compression. The numeric threshold + is implementation dependent. + +Current IPComp implementation is indeed by the book, while as in practice +when sending non-compressed packet to the peer (whether or not packet len +is smaller than the threshold or the compressed len is larger than original +packet len), the packet is dropped when checking the policy as this packet +matches the selector but not coming from any XFRM layer, i.e., with no +security path. Such naked packet will not eventually make it to upper layer. +The result is much more wired to the user when ping peer with different +payload length. + +One workaround is try to set "level use" for each policy if user observed +above scenario. The consequence of doing so is small packet(uncompressed) +will skip policy checking on receiver side. diff --git a/Documentation/networking/ipsec.txt b/Documentation/networking/ipsec.txt deleted file mode 100644 index ba794b7e51be..000000000000 --- a/Documentation/networking/ipsec.txt +++ /dev/null @@ -1,38 +0,0 @@ - -Here documents known IPsec corner cases which need to be keep in mind when -deploy various IPsec configuration in real world production environment. - -1. IPcomp: Small IP packet won't get compressed at sender, and failed on - policy check on receiver. - -Quote from RFC3173: -2.2. Non-Expansion Policy - - If the total size of a compressed payload and the IPComp header, as - defined in section 3, is not smaller than the size of the original - payload, the IP datagram MUST be sent in the original non-compressed - form. To clarify: If an IP datagram is sent non-compressed, no - - IPComp header is added to the datagram. This policy ensures saving - the decompression processing cycles and avoiding incurring IP - datagram fragmentation when the expanded datagram is larger than the - MTU. - - Small IP datagrams are likely to expand as a result of compression. - Therefore, a numeric threshold should be applied before compression, - where IP datagrams of size smaller than the threshold are sent in the - original form without attempting compression. The numeric threshold - is implementation dependent. - -Current IPComp implementation is indeed by the book, while as in practice -when sending non-compressed packet to the peer (whether or not packet len -is smaller than the threshold or the compressed len is larger than original -packet len), the packet is dropped when checking the policy as this packet -matches the selector but not coming from any XFRM layer, i.e., with no -security path. Such naked packet will not eventually make it to upper layer. -The result is much more wired to the user when ping peer with different -payload length. - -One workaround is try to set "level use" for each policy if user observed -above scenario. The consequence of doing so is small packet(uncompressed) -will skip policy checking on receiver side. -- cgit v1.2.3 From 1cec2cacaaec5d53adc04dd3ecfdb687b26c0e89 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:49 +0200 Subject: docs: networking: convert ip-sysctl.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - mark code blocks and literals as such; - mark lists as such; - mark tables as such; - use footnote markup; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/admin-guide/kernel-parameters.txt | 2 +- Documentation/admin-guide/sysctl/net.rst | 2 +- Documentation/networking/index.rst | 1 + Documentation/networking/ip-sysctl.rst | 2649 +++++++++++++++++++++++ Documentation/networking/ip-sysctl.txt | 2374 -------------------- Documentation/networking/snmp_counter.rst | 2 +- net/Kconfig | 2 +- net/ipv4/Kconfig | 2 +- net/ipv4/icmp.c | 2 +- 9 files changed, 2656 insertions(+), 2380 deletions(-) create mode 100644 Documentation/networking/ip-sysctl.rst delete mode 100644 Documentation/networking/ip-sysctl.txt (limited to 'Documentation') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index b23ab11587a6..e37db6f1be64 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4910,7 +4910,7 @@ Set the number of tcp_metrics_hash slots. Default value is 8192 or 16384 depending on total ram pages. This is used to specify the TCP metrics - cache size. See Documentation/networking/ip-sysctl.txt + cache size. See Documentation/networking/ip-sysctl.rst "tcp_no_metrics_save" section for more details. tdfx= [HW,DRM] diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst index e043c9213388..84e3348a9543 100644 --- a/Documentation/admin-guide/sysctl/net.rst +++ b/Documentation/admin-guide/sysctl/net.rst @@ -353,7 +353,7 @@ socket's buffer. It will not take effect unless PF_UNIX flag is specified. 3. /proc/sys/net/ipv4 - IPV4 settings ------------------------------------- -Please see: Documentation/networking/ip-sysctl.txt and ipvs-sysctl.txt for +Please see: Documentation/networking/ip-sysctl.rst and ipvs-sysctl.txt for descriptions of these entries. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 3efb4608649a..7d133d8dbe2a 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -69,6 +69,7 @@ Contents: ip_dynaddr iphase ipsec + ip-sysctl .. only:: subproject and html diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst new file mode 100644 index 000000000000..38f811d4b2f0 --- /dev/null +++ b/Documentation/networking/ip-sysctl.rst @@ -0,0 +1,2649 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========= +IP Sysctl +========= + +/proc/sys/net/ipv4/* Variables +============================== + +ip_forward - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + Forward Packets between interfaces. + + This variable is special, its change resets all configuration + parameters to their default state (RFC1122 for hosts, RFC1812 + for routers) + +ip_default_ttl - INTEGER + Default value of TTL field (Time To Live) for outgoing (but not + forwarded) IP packets. Should be between 1 and 255 inclusive. + Default: 64 (as recommended by RFC1700) + +ip_no_pmtu_disc - INTEGER + Disable Path MTU Discovery. If enabled in mode 1 and a + fragmentation-required ICMP is received, the PMTU to this + destination will be set to min_pmtu (see below). You will need + to raise min_pmtu to the smallest interface MTU on your system + manually if you want to avoid locally generated fragments. + + In mode 2 incoming Path MTU Discovery messages will be + discarded. Outgoing frames are handled the same as in mode 1, + implicitly setting IP_PMTUDISC_DONT on every created socket. + + Mode 3 is a hardened pmtu discover mode. The kernel will only + accept fragmentation-needed errors if the underlying protocol + can verify them besides a plain socket lookup. Current + protocols for which pmtu events will be honored are TCP, SCTP + and DCCP as they verify e.g. the sequence number or the + association. This mode should not be enabled globally but is + only intended to secure e.g. name servers in namespaces where + TCP path mtu must still work but path MTU information of other + protocols should be discarded. If enabled globally this mode + could break other protocols. + + Possible values: 0-3 + + Default: FALSE + +min_pmtu - INTEGER + default 552 - minimum discovered Path MTU + +ip_forward_use_pmtu - BOOLEAN + By default we don't trust protocol path MTUs while forwarding + because they could be easily forged and can lead to unwanted + fragmentation by the router. + You only need to enable this if you have user-space software + which tries to discover path mtus by itself and depends on the + kernel honoring this information. This is normally not the + case. + + Default: 0 (disabled) + + Possible values: + + - 0 - disabled + - 1 - enabled + +fwmark_reflect - BOOLEAN + Controls the fwmark of kernel-generated IPv4 reply packets that are not + associated with a socket for example, TCP RSTs or ICMP echo replies). + If unset, these packets have a fwmark of zero. If set, they have the + fwmark of the packet they are replying to. + + Default: 0 + +fib_multipath_use_neigh - BOOLEAN + Use status of existing neighbor entry when determining nexthop for + multipath routes. If disabled, neighbor information is not used and + packets could be directed to a failed nexthop. Only valid for kernels + built with CONFIG_IP_ROUTE_MULTIPATH enabled. + + Default: 0 (disabled) + + Possible values: + + - 0 - disabled + - 1 - enabled + +fib_multipath_hash_policy - INTEGER + Controls which hash policy to use for multipath routes. Only valid + for kernels built with CONFIG_IP_ROUTE_MULTIPATH enabled. + + Default: 0 (Layer 3) + + Possible values: + + - 0 - Layer 3 + - 1 - Layer 4 + - 2 - Layer 3 or inner Layer 3 if present + +fib_sync_mem - UNSIGNED INTEGER + Amount of dirty memory from fib entries that can be backlogged before + synchronize_rcu is forced. + + Default: 512kB Minimum: 64kB Maximum: 64MB + +ip_forward_update_priority - INTEGER + Whether to update SKB priority from "TOS" field in IPv4 header after it + is forwarded. The new SKB priority is mapped from TOS field value + according to an rt_tos2priority table (see e.g. man tc-prio). + + Default: 1 (Update priority.) + + Possible values: + + - 0 - Do not update priority. + - 1 - Update priority. + +route/max_size - INTEGER + Maximum number of routes allowed in the kernel. Increase + this when using large numbers of interfaces and/or routes. + + From linux kernel 3.6 onwards, this is deprecated for ipv4 + as route cache is no longer used. + +neigh/default/gc_thresh1 - INTEGER + Minimum number of entries to keep. Garbage collector will not + purge entries if there are fewer than this number. + + Default: 128 + +neigh/default/gc_thresh2 - INTEGER + Threshold when garbage collector becomes more aggressive about + purging entries. Entries older than 5 seconds will be cleared + when over this number. + + Default: 512 + +neigh/default/gc_thresh3 - INTEGER + Maximum number of non-PERMANENT neighbor entries allowed. Increase + this when using large numbers of interfaces and when communicating + with large numbers of directly-connected peers. + + Default: 1024 + +neigh/default/unres_qlen_bytes - INTEGER + The maximum number of bytes which may be used by packets + queued for each unresolved address by other network layers. + (added in linux 3.3) + + Setting negative value is meaningless and will return error. + + Default: SK_WMEM_MAX, (same as net.core.wmem_default). + + Exact value depends on architecture and kernel options, + but should be enough to allow queuing 256 packets + of medium size. + +neigh/default/unres_qlen - INTEGER + The maximum number of packets which may be queued for each + unresolved address by other network layers. + + (deprecated in linux 3.3) : use unres_qlen_bytes instead. + + Prior to linux 3.3, the default value is 3 which may cause + unexpected packet loss. The current default value is calculated + according to default value of unres_qlen_bytes and true size of + packet. + + Default: 101 + +mtu_expires - INTEGER + Time, in seconds, that cached PMTU information is kept. + +min_adv_mss - INTEGER + The advertised MSS depends on the first hop route MTU, but will + never be lower than this setting. + +IP Fragmentation: + +ipfrag_high_thresh - LONG INTEGER + Maximum memory used to reassemble IP fragments. + +ipfrag_low_thresh - LONG INTEGER + (Obsolete since linux-4.17) + Maximum memory used to reassemble IP fragments before the kernel + begins to remove incomplete fragment queues to free up resources. + The kernel still accepts new fragments for defragmentation. + +ipfrag_time - INTEGER + Time in seconds to keep an IP fragment in memory. + +ipfrag_max_dist - INTEGER + ipfrag_max_dist is a non-negative integer value which defines the + maximum "disorder" which is allowed among fragments which share a + common IP source address. Note that reordering of packets is + not unusual, but if a large number of fragments arrive from a source + IP address while a particular fragment queue remains incomplete, it + probably indicates that one or more fragments belonging to that queue + have been lost. When ipfrag_max_dist is positive, an additional check + is done on fragments before they are added to a reassembly queue - if + ipfrag_max_dist (or more) fragments have arrived from a particular IP + address between additions to any IP fragment queue using that source + address, it's presumed that one or more fragments in the queue are + lost. The existing fragment queue will be dropped, and a new one + started. An ipfrag_max_dist value of zero disables this check. + + Using a very small value, e.g. 1 or 2, for ipfrag_max_dist can + result in unnecessarily dropping fragment queues when normal + reordering of packets occurs, which could lead to poor application + performance. Using a very large value, e.g. 50000, increases the + likelihood of incorrectly reassembling IP fragments that originate + from different IP datagrams, which could result in data corruption. + Default: 64 + +INET peer storage +================= + +inet_peer_threshold - INTEGER + The approximate size of the storage. Starting from this threshold + entries will be thrown aggressively. This threshold also determines + entries' time-to-live and time intervals between garbage collection + passes. More entries, less time-to-live, less GC interval. + +inet_peer_minttl - INTEGER + Minimum time-to-live of entries. Should be enough to cover fragment + time-to-live on the reassembling side. This minimum time-to-live is + guaranteed if the pool size is less than inet_peer_threshold. + Measured in seconds. + +inet_peer_maxttl - INTEGER + Maximum time-to-live of entries. Unused entries will expire after + this period of time if there is no memory pressure on the pool (i.e. + when the number of entries in the pool is very small). + Measured in seconds. + +TCP variables +============= + +somaxconn - INTEGER + Limit of socket listen() backlog, known in userspace as SOMAXCONN. + Defaults to 4096. (Was 128 before linux-5.4) + See also tcp_max_syn_backlog for additional tuning for TCP sockets. + +tcp_abort_on_overflow - BOOLEAN + If listening service is too slow to accept new connections, + reset them. Default state is FALSE. It means that if overflow + occurred due to a burst, connection will recover. Enable this + option _only_ if you are really sure that listening daemon + cannot be tuned to accept connections faster. Enabling this + option can harm clients of your server. + +tcp_adv_win_scale - INTEGER + Count buffering overhead as bytes/2^tcp_adv_win_scale + (if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale), + if it is <= 0. + + Possible values are [-31, 31], inclusive. + + Default: 1 + +tcp_allowed_congestion_control - STRING + Show/set the congestion control choices available to non-privileged + processes. The list is a subset of those listed in + tcp_available_congestion_control. + + Default is "reno" and the default setting (tcp_congestion_control). + +tcp_app_win - INTEGER + Reserve max(window/2^tcp_app_win, mss) of window for application + buffer. Value 0 is special, it means that nothing is reserved. + + Default: 31 + +tcp_autocorking - BOOLEAN + Enable TCP auto corking : + When applications do consecutive small write()/sendmsg() system calls, + we try to coalesce these small writes as much as possible, to lower + total amount of sent packets. This is done if at least one prior + packet for the flow is waiting in Qdisc queues or device transmit + queue. Applications can still use TCP_CORK for optimal behavior + when they know how/when to uncork their sockets. + + Default : 1 + +tcp_available_congestion_control - STRING + Shows the available congestion control choices that are registered. + More congestion control algorithms may be available as modules, + but not loaded. + +tcp_base_mss - INTEGER + The initial value of search_low to be used by the packetization layer + Path MTU discovery (MTU probing). If MTU probing is enabled, + this is the initial MSS used by the connection. + +tcp_mtu_probe_floor - INTEGER + If MTU probing is enabled this caps the minimum MSS used for search_low + for the connection. + + Default : 48 + +tcp_min_snd_mss - INTEGER + TCP SYN and SYNACK messages usually advertise an ADVMSS option, + as described in RFC 1122 and RFC 6691. + + If this ADVMSS option is smaller than tcp_min_snd_mss, + it is silently capped to tcp_min_snd_mss. + + Default : 48 (at least 8 bytes of payload per segment) + +tcp_congestion_control - STRING + Set the congestion control algorithm to be used for new + connections. The algorithm "reno" is always available, but + additional choices may be available based on kernel configuration. + Default is set as part of kernel configuration. + For passive connections, the listener congestion control choice + is inherited. + + [see setsockopt(listenfd, SOL_TCP, TCP_CONGESTION, "name" ...) ] + +tcp_dsack - BOOLEAN + Allows TCP to send "duplicate" SACKs. + +tcp_early_retrans - INTEGER + Tail loss probe (TLP) converts RTOs occurring due to tail + losses into fast recovery (draft-ietf-tcpm-rack). Note that + TLP requires RACK to function properly (see tcp_recovery below) + + Possible values: + + - 0 disables TLP + - 3 or 4 enables TLP + + Default: 3 + +tcp_ecn - INTEGER + Control use of Explicit Congestion Notification (ECN) by TCP. + ECN is used only when both ends of the TCP connection indicate + support for it. This feature is useful in avoiding losses due + to congestion by allowing supporting routers to signal + congestion before having to drop packets. + + Possible values are: + + = ===================================================== + 0 Disable ECN. Neither initiate nor accept ECN. + 1 Enable ECN when requested by incoming connections and + also request ECN on outgoing connection attempts. + 2 Enable ECN when requested by incoming connections + but do not request ECN on outgoing connections. + = ===================================================== + + Default: 2 + +tcp_ecn_fallback - BOOLEAN + If the kernel detects that ECN connection misbehaves, enable fall + back to non-ECN. Currently, this knob implements the fallback + from RFC3168, section 6.1.1.1., but we reserve that in future, + additional detection mechanisms could be implemented under this + knob. The value is not used, if tcp_ecn or per route (or congestion + control) ECN settings are disabled. + + Default: 1 (fallback enabled) + +tcp_fack - BOOLEAN + This is a legacy option, it has no effect anymore. + +tcp_fin_timeout - INTEGER + The length of time an orphaned (no longer referenced by any + application) connection will remain in the FIN_WAIT_2 state + before it is aborted at the local end. While a perfectly + valid "receive only" state for an un-orphaned connection, an + orphaned connection in FIN_WAIT_2 state could otherwise wait + forever for the remote to close its end of the connection. + + Cf. tcp_max_orphans + + Default: 60 seconds + +tcp_frto - INTEGER + Enables Forward RTO-Recovery (F-RTO) defined in RFC5682. + F-RTO is an enhanced recovery algorithm for TCP retransmission + timeouts. It is particularly beneficial in networks where the + RTT fluctuates (e.g., wireless). F-RTO is sender-side only + modification. It does not require any support from the peer. + + By default it's enabled with a non-zero value. 0 disables F-RTO. + +tcp_fwmark_accept - BOOLEAN + If set, incoming connections to listening sockets that do not have a + socket mark will set the mark of the accepting socket to the fwmark of + the incoming SYN packet. This will cause all packets on that connection + (starting from the first SYNACK) to be sent with that fwmark. The + listening socket's mark is unchanged. Listening sockets that already + have a fwmark set via setsockopt(SOL_SOCKET, SO_MARK, ...) are + unaffected. + + Default: 0 + +tcp_invalid_ratelimit - INTEGER + Limit the maximal rate for sending duplicate acknowledgments + in response to incoming TCP packets that are for an existing + connection but that are invalid due to any of these reasons: + + (a) out-of-window sequence number, + (b) out-of-window acknowledgment number, or + (c) PAWS (Protection Against Wrapped Sequence numbers) check failure + + This can help mitigate simple "ack loop" DoS attacks, wherein + a buggy or malicious middlebox or man-in-the-middle can + rewrite TCP header fields in manner that causes each endpoint + to think that the other is sending invalid TCP segments, thus + causing each side to send an unterminating stream of duplicate + acknowledgments for invalid segments. + + Using 0 disables rate-limiting of dupacks in response to + invalid segments; otherwise this value specifies the minimal + space between sending such dupacks, in milliseconds. + + Default: 500 (milliseconds). + +tcp_keepalive_time - INTEGER + How often TCP sends out keepalive messages when keepalive is enabled. + Default: 2hours. + +tcp_keepalive_probes - INTEGER + How many keepalive probes TCP sends out, until it decides that the + connection is broken. Default value: 9. + +tcp_keepalive_intvl - INTEGER + How frequently the probes are send out. Multiplied by + tcp_keepalive_probes it is time to kill not responding connection, + after probes started. Default value: 75sec i.e. connection + will be aborted after ~11 minutes of retries. + +tcp_l3mdev_accept - BOOLEAN + Enables child sockets to inherit the L3 master device index. + Enabling this option allows a "global" listen socket to work + across L3 master domains (e.g., VRFs) with connected sockets + derived from the listen socket to be bound to the L3 domain in + which the packets originated. Only valid when the kernel was + compiled with CONFIG_NET_L3_MASTER_DEV. + + Default: 0 (disabled) + +tcp_low_latency - BOOLEAN + This is a legacy option, it has no effect anymore. + +tcp_max_orphans - INTEGER + Maximal number of TCP sockets not attached to any user file handle, + held by system. If this number is exceeded orphaned connections are + reset immediately and warning is printed. This limit exists + only to prevent simple DoS attacks, you _must_ not rely on this + or lower the limit artificially, but rather increase it + (probably, after increasing installed memory), + if network conditions require more than default value, + and tune network services to linger and kill such states + more aggressively. Let me to remind again: each orphan eats + up to ~64K of unswappable memory. + +tcp_max_syn_backlog - INTEGER + Maximal number of remembered connection requests (SYN_RECV), + which have not received an acknowledgment from connecting client. + + This is a per-listener limit. + + The minimal value is 128 for low memory machines, and it will + increase in proportion to the memory of machine. + + If server suffers from overload, try increasing this number. + + Remember to also check /proc/sys/net/core/somaxconn + A SYN_RECV request socket consumes about 304 bytes of memory. + +tcp_max_tw_buckets - INTEGER + Maximal number of timewait sockets held by system simultaneously. + If this number is exceeded time-wait socket is immediately destroyed + and warning is printed. This limit exists only to prevent + simple DoS attacks, you _must_ not lower the limit artificially, + but rather increase it (probably, after increasing installed memory), + if network conditions require more than default value. + +tcp_mem - vector of 3 INTEGERs: min, pressure, max + min: below this number of pages TCP is not bothered about its + memory appetite. + + pressure: when amount of memory allocated by TCP exceeds this number + of pages, TCP moderates its memory consumption and enters memory + pressure mode, which is exited when memory consumption falls + under "min". + + max: number of pages allowed for queueing by all TCP sockets. + + Defaults are calculated at boot time from amount of available + memory. + +tcp_min_rtt_wlen - INTEGER + The window length of the windowed min filter to track the minimum RTT. + A shorter window lets a flow more quickly pick up new (higher) + minimum RTT when it is moved to a longer path (e.g., due to traffic + engineering). A longer window makes the filter more resistant to RTT + inflations such as transient congestion. The unit is seconds. + + Possible values: 0 - 86400 (1 day) + + Default: 300 + +tcp_moderate_rcvbuf - BOOLEAN + If set, TCP performs receive buffer auto-tuning, attempting to + automatically size the buffer (no greater than tcp_rmem[2]) to + match the size required by the path for full throughput. Enabled by + default. + +tcp_mtu_probing - INTEGER + Controls TCP Packetization-Layer Path MTU Discovery. Takes three + values: + + - 0 - Disabled + - 1 - Disabled by default, enabled when an ICMP black hole detected + - 2 - Always enabled, use initial MSS of tcp_base_mss. + +tcp_probe_interval - UNSIGNED INTEGER + Controls how often to start TCP Packetization-Layer Path MTU + Discovery reprobe. The default is reprobing every 10 minutes as + per RFC4821. + +tcp_probe_threshold - INTEGER + Controls when TCP Packetization-Layer Path MTU Discovery probing + will stop in respect to the width of search range in bytes. Default + is 8 bytes. + +tcp_no_metrics_save - BOOLEAN + By default, TCP saves various connection metrics in the route cache + when the connection closes, so that connections established in the + near future can use these to set initial conditions. Usually, this + increases overall performance, but may sometimes cause performance + degradation. If set, TCP will not cache metrics on closing + connections. + +tcp_no_ssthresh_metrics_save - BOOLEAN + Controls whether TCP saves ssthresh metrics in the route cache. + + Default is 1, which disables ssthresh metrics. + +tcp_orphan_retries - INTEGER + This value influences the timeout of a locally closed TCP connection, + when RTO retransmissions remain unacknowledged. + See tcp_retries2 for more details. + + The default value is 8. + + If your machine is a loaded WEB server, + you should think about lowering this value, such sockets + may consume significant resources. Cf. tcp_max_orphans. + +tcp_recovery - INTEGER + This value is a bitmap to enable various experimental loss recovery + features. + + ========= ============================================================= + RACK: 0x1 enables the RACK loss detection for fast detection of lost + retransmissions and tail drops. It also subsumes and disables + RFC6675 recovery for SACK connections. + + RACK: 0x2 makes RACK's reordering window static (min_rtt/4). + + RACK: 0x4 disables RACK's DUPACK threshold heuristic + ========= ============================================================= + + Default: 0x1 + +tcp_reordering - INTEGER + Initial reordering level of packets in a TCP stream. + TCP stack can then dynamically adjust flow reordering level + between this initial value and tcp_max_reordering + + Default: 3 + +tcp_max_reordering - INTEGER + Maximal reordering level of packets in a TCP stream. + 300 is a fairly conservative value, but you might increase it + if paths are using per packet load balancing (like bonding rr mode) + + Default: 300 + +tcp_retrans_collapse - BOOLEAN + Bug-to-bug compatibility with some broken printers. + On retransmit try to send bigger packets to work around bugs in + certain TCP stacks. + +tcp_retries1 - INTEGER + This value influences the time, after which TCP decides, that + something is wrong due to unacknowledged RTO retransmissions, + and reports this suspicion to the network layer. + See tcp_retries2 for more details. + + RFC 1122 recommends at least 3 retransmissions, which is the + default. + +tcp_retries2 - INTEGER + This value influences the timeout of an alive TCP connection, + when RTO retransmissions remain unacknowledged. + Given a value of N, a hypothetical TCP connection following + exponential backoff with an initial RTO of TCP_RTO_MIN would + retransmit N times before killing the connection at the (N+1)th RTO. + + The default value of 15 yields a hypothetical timeout of 924.6 + seconds and is a lower bound for the effective timeout. + TCP will effectively time out at the first RTO which exceeds the + hypothetical timeout. + + RFC 1122 recommends at least 100 seconds for the timeout, + which corresponds to a value of at least 8. + +tcp_rfc1337 - BOOLEAN + If set, the TCP stack behaves conforming to RFC1337. If unset, + we are not conforming to RFC, but prevent TCP TIME_WAIT + assassination. + + Default: 0 + +tcp_rmem - vector of 3 INTEGERs: min, default, max + min: Minimal size of receive buffer used by TCP sockets. + It is guaranteed to each TCP socket, even under moderate memory + pressure. + + Default: 4K + + default: initial size of receive buffer used by TCP sockets. + This value overrides net.core.rmem_default used by other protocols. + Default: 87380 bytes. This value results in window of 65535 with + default setting of tcp_adv_win_scale and tcp_app_win:0 and a bit + less for default tcp_app_win. See below about these variables. + + max: maximal size of receive buffer allowed for automatically + selected receiver buffers for TCP socket. This value does not override + net.core.rmem_max. Calling setsockopt() with SO_RCVBUF disables + automatic tuning of that socket's receive buffer size, in which + case this value is ignored. + Default: between 87380B and 6MB, depending on RAM size. + +tcp_sack - BOOLEAN + Enable select acknowledgments (SACKS). + +tcp_comp_sack_delay_ns - LONG INTEGER + TCP tries to reduce number of SACK sent, using a timer + based on 5% of SRTT, capped by this sysctl, in nano seconds. + The default is 1ms, based on TSO autosizing period. + + Default : 1,000,000 ns (1 ms) + +tcp_comp_sack_nr - INTEGER + Max number of SACK that can be compressed. + Using 0 disables SACK compression. + + Default : 44 + +tcp_slow_start_after_idle - BOOLEAN + If set, provide RFC2861 behavior and time out the congestion + window after an idle period. An idle period is defined at + the current RTO. If unset, the congestion window will not + be timed out after an idle period. + + Default: 1 + +tcp_stdurg - BOOLEAN + Use the Host requirements interpretation of the TCP urgent pointer field. + Most hosts use the older BSD interpretation, so if you turn this on + Linux might not communicate correctly with them. + + Default: FALSE + +tcp_synack_retries - INTEGER + Number of times SYNACKs for a passive TCP connection attempt will + be retransmitted. Should not be higher than 255. Default value + is 5, which corresponds to 31seconds till the last retransmission + with the current initial RTO of 1second. With this the final timeout + for a passive TCP connection will happen after 63seconds. + +tcp_syncookies - INTEGER + Only valid when the kernel was compiled with CONFIG_SYN_COOKIES + Send out syncookies when the syn backlog queue of a socket + overflows. This is to prevent against the common 'SYN flood attack' + Default: 1 + + Note, that syncookies is fallback facility. + It MUST NOT be used to help highly loaded servers to stand + against legal connection rate. If you see SYN flood warnings + in your logs, but investigation shows that they occur + because of overload with legal connections, you should tune + another parameters until this warning disappear. + See: tcp_max_syn_backlog, tcp_synack_retries, tcp_abort_on_overflow. + + syncookies seriously violate TCP protocol, do not allow + to use TCP extensions, can result in serious degradation + of some services (f.e. SMTP relaying), visible not by you, + but your clients and relays, contacting you. While you see + SYN flood warnings in logs not being really flooded, your server + is seriously misconfigured. + + If you want to test which effects syncookies have to your + network connections you can set this knob to 2 to enable + unconditionally generation of syncookies. + +tcp_fastopen - INTEGER + Enable TCP Fast Open (RFC7413) to send and accept data in the opening + SYN packet. + + The client support is enabled by flag 0x1 (on by default). The client + then must use sendmsg() or sendto() with the MSG_FASTOPEN flag, + rather than connect() to send data in SYN. + + The server support is enabled by flag 0x2 (off by default). Then + either enable for all listeners with another flag (0x400) or + enable individual listeners via TCP_FASTOPEN socket option with + the option value being the length of the syn-data backlog. + + The values (bitmap) are + + ===== ======== ====================================================== + 0x1 (client) enables sending data in the opening SYN on the client. + 0x2 (server) enables the server support, i.e., allowing data in + a SYN packet to be accepted and passed to the + application before 3-way handshake finishes. + 0x4 (client) send data in the opening SYN regardless of cookie + availability and without a cookie option. + 0x200 (server) accept data-in-SYN w/o any cookie option present. + 0x400 (server) enable all listeners to support Fast Open by + default without explicit TCP_FASTOPEN socket option. + ===== ======== ====================================================== + + Default: 0x1 + + Note that that additional client or server features are only + effective if the basic support (0x1 and 0x2) are enabled respectively. + +tcp_fastopen_blackhole_timeout_sec - INTEGER + Initial time period in second to disable Fastopen on active TCP sockets + when a TFO firewall blackhole issue happens. + This time period will grow exponentially when more blackhole issues + get detected right after Fastopen is re-enabled and will reset to + initial value when the blackhole issue goes away. + 0 to disable the blackhole detection. + + By default, it is set to 1hr. + +tcp_fastopen_key - list of comma separated 32-digit hexadecimal INTEGERs + The list consists of a primary key and an optional backup key. The + primary key is used for both creating and validating cookies, while the + optional backup key is only used for validating cookies. The purpose of + the backup key is to maximize TFO validation when keys are rotated. + + A randomly chosen primary key may be configured by the kernel if + the tcp_fastopen sysctl is set to 0x400 (see above), or if the + TCP_FASTOPEN setsockopt() optname is set and a key has not been + previously configured via sysctl. If keys are configured via + setsockopt() by using the TCP_FASTOPEN_KEY optname, then those + per-socket keys will be used instead of any keys that are specified via + sysctl. + + A key is specified as 4 8-digit hexadecimal integers which are separated + by a '-' as: xxxxxxxx-xxxxxxxx-xxxxxxxx-xxxxxxxx. Leading zeros may be + omitted. A primary and a backup key may be specified by separating them + by a comma. If only one key is specified, it becomes the primary key and + any previously configured backup keys are removed. + +tcp_syn_retries - INTEGER + Number of times initial SYNs for an active TCP connection attempt + will be retransmitted. Should not be higher than 127. Default value + is 6, which corresponds to 63seconds till the last retransmission + with the current initial RTO of 1second. With this the final timeout + for an active TCP connection attempt will happen after 127seconds. + +tcp_timestamps - INTEGER + Enable timestamps as defined in RFC1323. + + - 0: Disabled. + - 1: Enable timestamps as defined in RFC1323 and use random offset for + each connection rather than only using the current time. + - 2: Like 1, but without random offsets. + + Default: 1 + +tcp_min_tso_segs - INTEGER + Minimal number of segments per TSO frame. + + Since linux-3.12, TCP does an automatic sizing of TSO frames, + depending on flow rate, instead of filling 64Kbytes packets. + For specific usages, it's possible to force TCP to build big + TSO frames. Note that TCP stack might split too big TSO packets + if available window is too small. + + Default: 2 + +tcp_pacing_ss_ratio - INTEGER + sk->sk_pacing_rate is set by TCP stack using a ratio applied + to current rate. (current_rate = cwnd * mss / srtt) + If TCP is in slow start, tcp_pacing_ss_ratio is applied + to let TCP probe for bigger speeds, assuming cwnd can be + doubled every other RTT. + + Default: 200 + +tcp_pacing_ca_ratio - INTEGER + sk->sk_pacing_rate is set by TCP stack using a ratio applied + to current rate. (current_rate = cwnd * mss / srtt) + If TCP is in congestion avoidance phase, tcp_pacing_ca_ratio + is applied to conservatively probe for bigger throughput. + + Default: 120 + +tcp_tso_win_divisor - INTEGER + This allows control over what percentage of the congestion window + can be consumed by a single TSO frame. + The setting of this parameter is a choice between burstiness and + building larger TSO frames. + + Default: 3 + +tcp_tw_reuse - INTEGER + Enable reuse of TIME-WAIT sockets for new connections when it is + safe from protocol viewpoint. + + - 0 - disable + - 1 - global enable + - 2 - enable for loopback traffic only + + It should not be changed without advice/request of technical + experts. + + Default: 2 + +tcp_window_scaling - BOOLEAN + Enable window scaling as defined in RFC1323. + +tcp_wmem - vector of 3 INTEGERs: min, default, max + min: Amount of memory reserved for send buffers for TCP sockets. + Each TCP socket has rights to use it due to fact of its birth. + + Default: 4K + + default: initial size of send buffer used by TCP sockets. This + value overrides net.core.wmem_default used by other protocols. + + It is usually lower than net.core.wmem_default. + + Default: 16K + + max: Maximal amount of memory allowed for automatically tuned + send buffers for TCP sockets. This value does not override + net.core.wmem_max. Calling setsockopt() with SO_SNDBUF disables + automatic tuning of that socket's send buffer size, in which case + this value is ignored. + + Default: between 64K and 4MB, depending on RAM size. + +tcp_notsent_lowat - UNSIGNED INTEGER + A TCP socket can control the amount of unsent bytes in its write queue, + thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll() + reports POLLOUT events if the amount of unsent bytes is below a per + socket value, and if the write queue is not full. sendmsg() will + also not add new buffers if the limit is hit. + + This global variable controls the amount of unsent data for + sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change + to the global variable has immediate effect. + + Default: UINT_MAX (0xFFFFFFFF) + +tcp_workaround_signed_windows - BOOLEAN + If set, assume no receipt of a window scaling option means the + remote TCP is broken and treats the window as a signed quantity. + If unset, assume the remote TCP is not broken even if we do + not receive a window scaling option from them. + + Default: 0 + +tcp_thin_linear_timeouts - BOOLEAN + Enable dynamic triggering of linear timeouts for thin streams. + If set, a check is performed upon retransmission by timeout to + determine if the stream is thin (less than 4 packets in flight). + As long as the stream is found to be thin, up to 6 linear + timeouts may be performed before exponential backoff mode is + initiated. This improves retransmission latency for + non-aggressive thin streams, often found to be time-dependent. + For more information on thin streams, see + Documentation/networking/tcp-thin.txt + + Default: 0 + +tcp_limit_output_bytes - INTEGER + Controls TCP Small Queue limit per tcp socket. + TCP bulk sender tends to increase packets in flight until it + gets losses notifications. With SNDBUF autotuning, this can + result in a large amount of packets queued on the local machine + (e.g.: qdiscs, CPU backlog, or device) hurting latency of other + flows, for typical pfifo_fast qdiscs. tcp_limit_output_bytes + limits the number of bytes on qdisc or device to reduce artificial + RTT/cwnd and reduce bufferbloat. + + Default: 1048576 (16 * 65536) + +tcp_challenge_ack_limit - INTEGER + Limits number of Challenge ACK sent per second, as recommended + in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks) + Default: 1000 + +tcp_rx_skb_cache - BOOLEAN + Controls a per TCP socket cache of one skb, that might help + performance of some workloads. This might be dangerous + on systems with a lot of TCP sockets, since it increases + memory usage. + + Default: 0 (disabled) + +UDP variables +============= + +udp_l3mdev_accept - BOOLEAN + Enabling this option allows a "global" bound socket to work + across L3 master domains (e.g., VRFs) with packets capable of + being received regardless of the L3 domain in which they + originated. Only valid when the kernel was compiled with + CONFIG_NET_L3_MASTER_DEV. + + Default: 0 (disabled) + +udp_mem - vector of 3 INTEGERs: min, pressure, max + Number of pages allowed for queueing by all UDP sockets. + + min: Below this number of pages UDP is not bothered about its + memory appetite. When amount of memory allocated by UDP exceeds + this number, UDP starts to moderate memory usage. + + pressure: This value was introduced to follow format of tcp_mem. + + max: Number of pages allowed for queueing by all UDP sockets. + + Default is calculated at boot time from amount of available memory. + +udp_rmem_min - INTEGER + Minimal size of receive buffer used by UDP sockets in moderation. + Each UDP socket is able to use the size for receiving data, even if + total pages of UDP sockets exceed udp_mem pressure. The unit is byte. + + Default: 4K + +udp_wmem_min - INTEGER + Minimal size of send buffer used by UDP sockets in moderation. + Each UDP socket is able to use the size for sending data, even if + total pages of UDP sockets exceed udp_mem pressure. The unit is byte. + + Default: 4K + +RAW variables +============= + +raw_l3mdev_accept - BOOLEAN + Enabling this option allows a "global" bound socket to work + across L3 master domains (e.g., VRFs) with packets capable of + being received regardless of the L3 domain in which they + originated. Only valid when the kernel was compiled with + CONFIG_NET_L3_MASTER_DEV. + + Default: 1 (enabled) + +CIPSOv4 Variables +================= + +cipso_cache_enable - BOOLEAN + If set, enable additions to and lookups from the CIPSO label mapping + cache. If unset, additions are ignored and lookups always result in a + miss. However, regardless of the setting the cache is still + invalidated when required when means you can safely toggle this on and + off and the cache will always be "safe". + + Default: 1 + +cipso_cache_bucket_size - INTEGER + The CIPSO label cache consists of a fixed size hash table with each + hash bucket containing a number of cache entries. This variable limits + the number of entries in each hash bucket; the larger the value the + more CIPSO label mappings that can be cached. When the number of + entries in a given hash bucket reaches this limit adding new entries + causes the oldest entry in the bucket to be removed to make room. + + Default: 10 + +cipso_rbm_optfmt - BOOLEAN + Enable the "Optimized Tag 1 Format" as defined in section 3.4.2.6 of + the CIPSO draft specification (see Documentation/netlabel for details). + This means that when set the CIPSO tag will be padded with empty + categories in order to make the packet data 32-bit aligned. + + Default: 0 + +cipso_rbm_structvalid - BOOLEAN + If set, do a very strict check of the CIPSO option when + ip_options_compile() is called. If unset, relax the checks done during + ip_options_compile(). Either way is "safe" as errors are caught else + where in the CIPSO processing code but setting this to 0 (False) should + result in less work (i.e. it should be faster) but could cause problems + with other implementations that require strict checking. + + Default: 0 + +IP Variables +============ + +ip_local_port_range - 2 INTEGERS + Defines the local port range that is used by TCP and UDP to + choose the local port. The first number is the first, the + second the last local port number. + If possible, it is better these numbers have different parity + (one even and one odd value). + Must be greater than or equal to ip_unprivileged_port_start. + The default values are 32768 and 60999 respectively. + +ip_local_reserved_ports - list of comma separated ranges + Specify the ports which are reserved for known third-party + applications. These ports will not be used by automatic port + assignments (e.g. when calling connect() or bind() with port + number 0). Explicit port allocation behavior is unchanged. + + The format used for both input and output is a comma separated + list of ranges (e.g. "1,2-4,10-10" for ports 1, 2, 3, 4 and + 10). Writing to the file will clear all previously reserved + ports and update the current list with the one given in the + input. + + Note that ip_local_port_range and ip_local_reserved_ports + settings are independent and both are considered by the kernel + when determining which ports are available for automatic port + assignments. + + You can reserve ports which are not in the current + ip_local_port_range, e.g.:: + + $ cat /proc/sys/net/ipv4/ip_local_port_range + 32000 60999 + $ cat /proc/sys/net/ipv4/ip_local_reserved_ports + 8080,9148 + + although this is redundant. However such a setting is useful + if later the port range is changed to a value that will + include the reserved ports. + + Default: Empty + +ip_unprivileged_port_start - INTEGER + This is a per-namespace sysctl. It defines the first + unprivileged port in the network namespace. Privileged ports + require root or CAP_NET_BIND_SERVICE in order to bind to them. + To disable all privileged ports, set this to 0. They must not + overlap with the ip_local_port_range. + + Default: 1024 + +ip_nonlocal_bind - BOOLEAN + If set, allows processes to bind() to non-local IP addresses, + which can be quite useful - but may break some applications. + + Default: 0 + +ip_autobind_reuse - BOOLEAN + By default, bind() does not select the ports automatically even if + the new socket and all sockets bound to the port have SO_REUSEADDR. + ip_autobind_reuse allows bind() to reuse the port and this is useful + when you use bind()+connect(), but may break some applications. + The preferred solution is to use IP_BIND_ADDRESS_NO_PORT and this + option should only be set by experts. + Default: 0 + +ip_dynaddr - BOOLEAN + If set non-zero, enables support for dynamic addresses. + If set to a non-zero value larger than 1, a kernel log + message will be printed when dynamic address rewriting + occurs. + + Default: 0 + +ip_early_demux - BOOLEAN + Optimize input packet processing down to one demux for + certain kinds of local sockets. Currently we only do this + for established TCP and connected UDP sockets. + + It may add an additional cost for pure routing workloads that + reduces overall throughput, in such case you should disable it. + + Default: 1 + +ping_group_range - 2 INTEGERS + Restrict ICMP_PROTO datagram sockets to users in the group range. + The default is "1 0", meaning, that nobody (not even root) may + create ping sockets. Setting it to "100 100" would grant permissions + to the single group. "0 4294967295" would enable it for the world, "100 + 4294967295" would enable it for the users, but not daemons. + +tcp_early_demux - BOOLEAN + Enable early demux for established TCP sockets. + + Default: 1 + +udp_early_demux - BOOLEAN + Enable early demux for connected UDP sockets. Disable this if + your system could experience more unconnected load. + + Default: 1 + +icmp_echo_ignore_all - BOOLEAN + If set non-zero, then the kernel will ignore all ICMP ECHO + requests sent to it. + + Default: 0 + +icmp_echo_ignore_broadcasts - BOOLEAN + If set non-zero, then the kernel will ignore all ICMP ECHO and + TIMESTAMP requests sent to it via broadcast/multicast. + + Default: 1 + +icmp_ratelimit - INTEGER + Limit the maximal rates for sending ICMP packets whose type matches + icmp_ratemask (see below) to specific targets. + 0 to disable any limiting, + otherwise the minimal space between responses in milliseconds. + Note that another sysctl, icmp_msgs_per_sec limits the number + of ICMP packets sent on all targets. + + Default: 1000 + +icmp_msgs_per_sec - INTEGER + Limit maximal number of ICMP packets sent per second from this host. + Only messages whose type matches icmp_ratemask (see below) are + controlled by this limit. + + Default: 1000 + +icmp_msgs_burst - INTEGER + icmp_msgs_per_sec controls number of ICMP packets sent per second, + while icmp_msgs_burst controls the burst size of these packets. + + Default: 50 + +icmp_ratemask - INTEGER + Mask made of ICMP types for which rates are being limited. + + Significant bits: IHGFEDCBA9876543210 + + Default mask: 0000001100000011000 (6168) + + Bit definitions (see include/linux/icmp.h): + + = ========================= + 0 Echo Reply + 3 Destination Unreachable [1]_ + 4 Source Quench [1]_ + 5 Redirect + 8 Echo Request + B Time Exceeded [1]_ + C Parameter Problem [1]_ + D Timestamp Request + E Timestamp Reply + F Info Request + G Info Reply + H Address Mask Request + I Address Mask Reply + = ========================= + + .. [1] These are rate limited by default (see default mask above) + +icmp_ignore_bogus_error_responses - BOOLEAN + Some routers violate RFC1122 by sending bogus responses to broadcast + frames. Such violations are normally logged via a kernel warning. + If this is set to TRUE, the kernel will not give such warnings, which + will avoid log file clutter. + + Default: 1 + +icmp_errors_use_inbound_ifaddr - BOOLEAN + + If zero, icmp error messages are sent with the primary address of + the exiting interface. + + If non-zero, the message will be sent with the primary address of + the interface that received the packet that caused the icmp error. + This is the behaviour network many administrators will expect from + a router. And it can make debugging complicated network layouts + much easier. + + Note that if no primary address exists for the interface selected, + then the primary address of the first non-loopback interface that + has one will be used regardless of this setting. + + Default: 0 + +igmp_max_memberships - INTEGER + Change the maximum number of multicast groups we can subscribe to. + Default: 20 + + Theoretical maximum value is bounded by having to send a membership + report in a single datagram (i.e. the report can't span multiple + datagrams, or risk confusing the switch and leaving groups you don't + intend to). + + The number of supported groups 'M' is bounded by the number of group + report entries you can fit into a single datagram of 65535 bytes. + + M = 65536-sizeof (ip header)/(sizeof(Group record)) + + Group records are variable length, with a minimum of 12 bytes. + So net.ipv4.igmp_max_memberships should not be set higher than: + + (65536-24) / 12 = 5459 + + The value 5459 assumes no IP header options, so in practice + this number may be lower. + +igmp_max_msf - INTEGER + Maximum number of addresses allowed in the source filter list for a + multicast group. + + Default: 10 + +igmp_qrv - INTEGER + Controls the IGMP query robustness variable (see RFC2236 8.1). + + Default: 2 (as specified by RFC2236 8.1) + + Minimum: 1 (as specified by RFC6636 4.5) + +force_igmp_version - INTEGER + - 0 - (default) No enforcement of a IGMP version, IGMPv1/v2 fallback + allowed. Will back to IGMPv3 mode again if all IGMPv1/v2 Querier + Present timer expires. + - 1 - Enforce to use IGMP version 1. Will also reply IGMPv1 report if + receive IGMPv2/v3 query. + - 2 - Enforce to use IGMP version 2. Will fallback to IGMPv1 if receive + IGMPv1 query message. Will reply report if receive IGMPv3 query. + - 3 - Enforce to use IGMP version 3. The same react with default 0. + + .. note:: + + this is not the same with force_mld_version because IGMPv3 RFC3376 + Security Considerations does not have clear description that we could + ignore other version messages completely as MLDv2 RFC3810. So make + this value as default 0 is recommended. + +``conf/interface/*`` + changes special settings per interface (where + interface" is the name of your network interface) + +``conf/all/*`` + is special, changes the settings for all interfaces + +log_martians - BOOLEAN + Log packets with impossible addresses to kernel log. + log_martians for the interface will be enabled if at least one of + conf/{all,interface}/log_martians is set to TRUE, + it will be disabled otherwise + +accept_redirects - BOOLEAN + Accept ICMP redirect messages. + accept_redirects for the interface will be enabled if: + + - both conf/{all,interface}/accept_redirects are TRUE in the case + forwarding for the interface is enabled + + or + + - at least one of conf/{all,interface}/accept_redirects is TRUE in the + case forwarding for the interface is disabled + + accept_redirects for the interface will be disabled otherwise + + default: + + - TRUE (host) + - FALSE (router) + +forwarding - BOOLEAN + Enable IP forwarding on this interface. This controls whether packets + received _on_ this interface can be forwarded. + +mc_forwarding - BOOLEAN + Do multicast routing. The kernel needs to be compiled with CONFIG_MROUTE + and a multicast routing daemon is required. + conf/all/mc_forwarding must also be set to TRUE to enable multicast + routing for the interface + +medium_id - INTEGER + Integer value used to differentiate the devices by the medium they + are attached to. Two devices can have different id values when + the broadcast packets are received only on one of them. + The default value 0 means that the device is the only interface + to its medium, value of -1 means that medium is not known. + + Currently, it is used to change the proxy_arp behavior: + the proxy_arp feature is enabled for packets forwarded between + two devices attached to different media. + +proxy_arp - BOOLEAN + Do proxy arp. + + proxy_arp for the interface will be enabled if at least one of + conf/{all,interface}/proxy_arp is set to TRUE, + it will be disabled otherwise + +proxy_arp_pvlan - BOOLEAN + Private VLAN proxy arp. + + Basically allow proxy arp replies back to the same interface + (from which the ARP request/solicitation was received). + + This is done to support (ethernet) switch features, like RFC + 3069, where the individual ports are NOT allowed to + communicate with each other, but they are allowed to talk to + the upstream router. As described in RFC 3069, it is possible + to allow these hosts to communicate through the upstream + router by proxy_arp'ing. Don't need to be used together with + proxy_arp. + + This technology is known by different names: + + In RFC 3069 it is called VLAN Aggregation. + Cisco and Allied Telesyn call it Private VLAN. + Hewlett-Packard call it Source-Port filtering or port-isolation. + Ericsson call it MAC-Forced Forwarding (RFC Draft). + +shared_media - BOOLEAN + Send(router) or accept(host) RFC1620 shared media redirects. + Overrides secure_redirects. + + shared_media for the interface will be enabled if at least one of + conf/{all,interface}/shared_media is set to TRUE, + it will be disabled otherwise + + default TRUE + +secure_redirects - BOOLEAN + Accept ICMP redirect messages only to gateways listed in the + interface's current gateway list. Even if disabled, RFC1122 redirect + rules still apply. + + Overridden by shared_media. + + secure_redirects for the interface will be enabled if at least one of + conf/{all,interface}/secure_redirects is set to TRUE, + it will be disabled otherwise + + default TRUE + +send_redirects - BOOLEAN + Send redirects, if router. + + send_redirects for the interface will be enabled if at least one of + conf/{all,interface}/send_redirects is set to TRUE, + it will be disabled otherwise + + Default: TRUE + +bootp_relay - BOOLEAN + Accept packets with source address 0.b.c.d destined + not to this host as local ones. It is supposed, that + BOOTP relay daemon will catch and forward such packets. + conf/all/bootp_relay must also be set to TRUE to enable BOOTP relay + for the interface + + default FALSE + + Not Implemented Yet. + +accept_source_route - BOOLEAN + Accept packets with SRR option. + conf/all/accept_source_route must also be set to TRUE to accept packets + with SRR option on the interface + + default + + - TRUE (router) + - FALSE (host) + +accept_local - BOOLEAN + Accept packets with local source addresses. In combination with + suitable routing, this can be used to direct packets between two + local interfaces over the wire and have them accepted properly. + default FALSE + +route_localnet - BOOLEAN + Do not consider loopback addresses as martian source or destination + while routing. This enables the use of 127/8 for local routing purposes. + + default FALSE + +rp_filter - INTEGER + - 0 - No source validation. + - 1 - Strict mode as defined in RFC3704 Strict Reverse Path + Each incoming packet is tested against the FIB and if the interface + is not the best reverse path the packet check will fail. + By default failed packets are discarded. + - 2 - Loose mode as defined in RFC3704 Loose Reverse Path + Each incoming packet's source address is also tested against the FIB + and if the source address is not reachable via any interface + the packet check will fail. + + Current recommended practice in RFC3704 is to enable strict mode + to prevent IP spoofing from DDos attacks. If using asymmetric routing + or other complicated routing, then loose mode is recommended. + + The max value from conf/{all,interface}/rp_filter is used + when doing source validation on the {interface}. + + Default value is 0. Note that some distributions enable it + in startup scripts. + +arp_filter - BOOLEAN + - 1 - Allows you to have multiple network interfaces on the same + subnet, and have the ARPs for each interface be answered + based on whether or not the kernel would route a packet from + the ARP'd IP out that interface (therefore you must use source + based routing for this to work). In other words it allows control + of which cards (usually 1) will respond to an arp request. + + - 0 - (default) The kernel can respond to arp requests with addresses + from other interfaces. This may seem wrong but it usually makes + sense, because it increases the chance of successful communication. + IP addresses are owned by the complete host on Linux, not by + particular interfaces. Only for more complex setups like load- + balancing, does this behaviour cause problems. + + arp_filter for the interface will be enabled if at least one of + conf/{all,interface}/arp_filter is set to TRUE, + it will be disabled otherwise + +arp_announce - INTEGER + Define different restriction levels for announcing the local + source IP address from IP packets in ARP requests sent on + interface: + + - 0 - (default) Use any local address, configured on any interface + - 1 - Try to avoid local addresses that are not in the target's + subnet for this interface. This mode is useful when target + hosts reachable via this interface require the source IP + address in ARP requests to be part of their logical network + configured on the receiving interface. When we generate the + request we will check all our subnets that include the + target IP and will preserve the source address if it is from + such subnet. If there is no such subnet we select source + address according to the rules for level 2. + - 2 - Always use the best local address for this target. + In this mode we ignore the source address in the IP packet + and try to select local address that we prefer for talks with + the target host. Such local address is selected by looking + for primary IP addresses on all our subnets on the outgoing + interface that include the target IP address. If no suitable + local address is found we select the first local address + we have on the outgoing interface or on all other interfaces, + with the hope we will receive reply for our request and + even sometimes no matter the source IP address we announce. + + The max value from conf/{all,interface}/arp_announce is used. + + Increasing the restriction level gives more chance for + receiving answer from the resolved target while decreasing + the level announces more valid sender's information. + +arp_ignore - INTEGER + Define different modes for sending replies in response to + received ARP requests that resolve local target IP addresses: + + - 0 - (default): reply for any local target IP address, configured + on any interface + - 1 - reply only if the target IP address is local address + configured on the incoming interface + - 2 - reply only if the target IP address is local address + configured on the incoming interface and both with the + sender's IP address are part from same subnet on this interface + - 3 - do not reply for local addresses configured with scope host, + only resolutions for global and link addresses are replied + - 4-7 - reserved + - 8 - do not reply for all local addresses + + The max value from conf/{all,interface}/arp_ignore is used + when ARP request is received on the {interface} + +arp_notify - BOOLEAN + Define mode for notification of address and device changes. + + == ========================================================== + 0 (default): do nothing + 1 Generate gratuitous arp requests when device is brought up + or hardware address changes. + == ========================================================== + +arp_accept - BOOLEAN + Define behavior for gratuitous ARP frames who's IP is not + already present in the ARP table: + + - 0 - don't create new entries in the ARP table + - 1 - create new entries in the ARP table + + Both replies and requests type gratuitous arp will trigger the + ARP table to be updated, if this setting is on. + + If the ARP table already contains the IP address of the + gratuitous arp frame, the arp table will be updated regardless + if this setting is on or off. + +mcast_solicit - INTEGER + The maximum number of multicast probes in INCOMPLETE state, + when the associated hardware address is unknown. Defaults + to 3. + +ucast_solicit - INTEGER + The maximum number of unicast probes in PROBE state, when + the hardware address is being reconfirmed. Defaults to 3. + +app_solicit - INTEGER + The maximum number of probes to send to the user space ARP daemon + via netlink before dropping back to multicast probes (see + mcast_resolicit). Defaults to 0. + +mcast_resolicit - INTEGER + The maximum number of multicast probes after unicast and + app probes in PROBE state. Defaults to 0. + +disable_policy - BOOLEAN + Disable IPSEC policy (SPD) for this interface + +disable_xfrm - BOOLEAN + Disable IPSEC encryption on this interface, whatever the policy + +igmpv2_unsolicited_report_interval - INTEGER + The interval in milliseconds in which the next unsolicited + IGMPv1 or IGMPv2 report retransmit will take place. + + Default: 10000 (10 seconds) + +igmpv3_unsolicited_report_interval - INTEGER + The interval in milliseconds in which the next unsolicited + IGMPv3 report retransmit will take place. + + Default: 1000 (1 seconds) + +promote_secondaries - BOOLEAN + When a primary IP address is removed from this interface + promote a corresponding secondary IP address instead of + removing all the corresponding secondary IP addresses. + +drop_unicast_in_l2_multicast - BOOLEAN + Drop any unicast IP packets that are received in link-layer + multicast (or broadcast) frames. + + This behavior (for multicast) is actually a SHOULD in RFC + 1122, but is disabled by default for compatibility reasons. + + Default: off (0) + +drop_gratuitous_arp - BOOLEAN + Drop all gratuitous ARP frames, for example if there's a known + good ARP proxy on the network and such frames need not be used + (or in the case of 802.11, must not be used to prevent attacks.) + + Default: off (0) + + +tag - INTEGER + Allows you to write a number, which can be used as required. + + Default value is 0. + +xfrm4_gc_thresh - INTEGER + (Obsolete since linux-4.14) + The threshold at which we will start garbage collecting for IPv4 + destination cache entries. At twice this value the system will + refuse new allocations. + +igmp_link_local_mcast_reports - BOOLEAN + Enable IGMP reports for link local multicast groups in the + 224.0.0.X range. + + Default TRUE + +Alexey Kuznetsov. +kuznet@ms2.inr.ac.ru + +Updated by: + +- Andi Kleen + ak@muc.de +- Nicolas Delon + delon.nicolas@wanadoo.fr + + + + +/proc/sys/net/ipv6/* Variables +============================== + +IPv6 has no global variables such as tcp_*. tcp_* settings under ipv4/ also +apply to IPv6 [XXX?]. + +bindv6only - BOOLEAN + Default value for IPV6_V6ONLY socket option, + which restricts use of the IPv6 socket to IPv6 communication + only. + + - TRUE: disable IPv4-mapped address feature + - FALSE: enable IPv4-mapped address feature + + Default: FALSE (as specified in RFC3493) + +flowlabel_consistency - BOOLEAN + Protect the consistency (and unicity) of flow label. + You have to disable it to use IPV6_FL_F_REFLECT flag on the + flow label manager. + + - TRUE: enabled + - FALSE: disabled + + Default: TRUE + +auto_flowlabels - INTEGER + Automatically generate flow labels based on a flow hash of the + packet. This allows intermediate devices, such as routers, to + identify packet flows for mechanisms like Equal Cost Multipath + Routing (see RFC 6438). + + = =========================================================== + 0 automatic flow labels are completely disabled + 1 automatic flow labels are enabled by default, they can be + disabled on a per socket basis using the IPV6_AUTOFLOWLABEL + socket option + 2 automatic flow labels are allowed, they may be enabled on a + per socket basis using the IPV6_AUTOFLOWLABEL socket option + 3 automatic flow labels are enabled and enforced, they cannot + be disabled by the socket option + = =========================================================== + + Default: 1 + +flowlabel_state_ranges - BOOLEAN + Split the flow label number space into two ranges. 0-0x7FFFF is + reserved for the IPv6 flow manager facility, 0x80000-0xFFFFF + is reserved for stateless flow labels as described in RFC6437. + + - TRUE: enabled + - FALSE: disabled + + Default: true + +flowlabel_reflect - INTEGER + Control flow label reflection. Needed for Path MTU + Discovery to work with Equal Cost Multipath Routing in anycast + environments. See RFC 7690 and: + https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01 + + This is a bitmask. + + - 1: enabled for established flows + + Note that this prevents automatic flowlabel changes, as done + in "tcp: change IPv6 flow-label upon receiving spurious retransmission" + and "tcp: Change txhash on every SYN and RTO retransmit" + + - 2: enabled for TCP RESET packets (no active listener) + If set, a RST packet sent in response to a SYN packet on a closed + port will reflect the incoming flow label. + + - 4: enabled for ICMPv6 echo reply messages. + + Default: 0 + +fib_multipath_hash_policy - INTEGER + Controls which hash policy to use for multipath routes. + + Default: 0 (Layer 3) + + Possible values: + + - 0 - Layer 3 (source and destination addresses plus flow label) + - 1 - Layer 4 (standard 5-tuple) + - 2 - Layer 3 or inner Layer 3 if present + +anycast_src_echo_reply - BOOLEAN + Controls the use of anycast addresses as source addresses for ICMPv6 + echo reply + + - TRUE: enabled + - FALSE: disabled + + Default: FALSE + +idgen_delay - INTEGER + Controls the delay in seconds after which time to retry + privacy stable address generation if a DAD conflict is + detected. + + Default: 1 (as specified in RFC7217) + +idgen_retries - INTEGER + Controls the number of retries to generate a stable privacy + address if a DAD conflict is detected. + + Default: 3 (as specified in RFC7217) + +mld_qrv - INTEGER + Controls the MLD query robustness variable (see RFC3810 9.1). + + Default: 2 (as specified by RFC3810 9.1) + + Minimum: 1 (as specified by RFC6636 4.5) + +max_dst_opts_number - INTEGER + Maximum number of non-padding TLVs allowed in a Destination + options extension header. If this value is less than zero + then unknown options are disallowed and the number of known + TLVs allowed is the absolute value of this number. + + Default: 8 + +max_hbh_opts_number - INTEGER + Maximum number of non-padding TLVs allowed in a Hop-by-Hop + options extension header. If this value is less than zero + then unknown options are disallowed and the number of known + TLVs allowed is the absolute value of this number. + + Default: 8 + +max_dst_opts_length - INTEGER + Maximum length allowed for a Destination options extension + header. + + Default: INT_MAX (unlimited) + +max_hbh_length - INTEGER + Maximum length allowed for a Hop-by-Hop options extension + header. + + Default: INT_MAX (unlimited) + +skip_notify_on_dev_down - BOOLEAN + Controls whether an RTM_DELROUTE message is generated for routes + removed when a device is taken down or deleted. IPv4 does not + generate this message; IPv6 does by default. Setting this sysctl + to true skips the message, making IPv4 and IPv6 on par in relying + on userspace caches to track link events and evict routes. + + Default: false (generate message) + +nexthop_compat_mode - BOOLEAN + New nexthop API provides a means for managing nexthops independent of + prefixes. Backwards compatibilty with old route format is enabled by + default which means route dumps and notifications contain the new + nexthop attribute but also the full, expanded nexthop definition. + Further, updates or deletes of a nexthop configuration generate route + notifications for each fib entry using the nexthop. Once a system + understands the new API, this sysctl can be disabled to achieve full + performance benefits of the new API by disabling the nexthop expansion + and extraneous notifications. + Default: true (backward compat mode) + +IPv6 Fragmentation: + +ip6frag_high_thresh - INTEGER + Maximum memory used to reassemble IPv6 fragments. When + ip6frag_high_thresh bytes of memory is allocated for this purpose, + the fragment handler will toss packets until ip6frag_low_thresh + is reached. + +ip6frag_low_thresh - INTEGER + See ip6frag_high_thresh + +ip6frag_time - INTEGER + Time in seconds to keep an IPv6 fragment in memory. + +IPv6 Segment Routing: + +seg6_flowlabel - INTEGER + Controls the behaviour of computing the flowlabel of outer + IPv6 header in case of SR T.encaps + + == ======================================================= + -1 set flowlabel to zero. + 0 copy flowlabel from Inner packet in case of Inner IPv6 + (Set flowlabel to 0 in case IPv4/L2) + 1 Compute the flowlabel using seg6_make_flowlabel() + == ======================================================= + + Default is 0. + +``conf/default/*``: + Change the interface-specific default settings. + + +``conf/all/*``: + Change all the interface-specific settings. + + [XXX: Other special features than forwarding?] + +conf/all/forwarding - BOOLEAN + Enable global IPv6 forwarding between all interfaces. + + IPv4 and IPv6 work differently here; e.g. netfilter must be used + to control which interfaces may forward packets and which not. + + This also sets all interfaces' Host/Router setting + 'forwarding' to the specified value. See below for details. + + This referred to as global forwarding. + +proxy_ndp - BOOLEAN + Do proxy ndp. + +fwmark_reflect - BOOLEAN + Controls the fwmark of kernel-generated IPv6 reply packets that are not + associated with a socket for example, TCP RSTs or ICMPv6 echo replies). + If unset, these packets have a fwmark of zero. If set, they have the + fwmark of the packet they are replying to. + + Default: 0 + +``conf/interface/*``: + Change special settings per interface. + + The functional behaviour for certain settings is different + depending on whether local forwarding is enabled or not. + +accept_ra - INTEGER + Accept Router Advertisements; autoconfigure using them. + + It also determines whether or not to transmit Router + Solicitations. If and only if the functional setting is to + accept Router Advertisements, Router Solicitations will be + transmitted. + + Possible values are: + + == =========================================================== + 0 Do not accept Router Advertisements. + 1 Accept Router Advertisements if forwarding is disabled. + 2 Overrule forwarding behaviour. Accept Router Advertisements + even if forwarding is enabled. + == =========================================================== + + Functional default: + + - enabled if local forwarding is disabled. + - disabled if local forwarding is enabled. + +accept_ra_defrtr - BOOLEAN + Learn default router in Router Advertisement. + + Functional default: + + - enabled if accept_ra is enabled. + - disabled if accept_ra is disabled. + +accept_ra_from_local - BOOLEAN + Accept RA with source-address that is found on local machine + if the RA is otherwise proper and able to be accepted. + + Default is to NOT accept these as it may be an un-intended + network loop. + + Functional default: + + - enabled if accept_ra_from_local is enabled + on a specific interface. + - disabled if accept_ra_from_local is disabled + on a specific interface. + +accept_ra_min_hop_limit - INTEGER + Minimum hop limit Information in Router Advertisement. + + Hop limit Information in Router Advertisement less than this + variable shall be ignored. + + Default: 1 + +accept_ra_pinfo - BOOLEAN + Learn Prefix Information in Router Advertisement. + + Functional default: + + - enabled if accept_ra is enabled. + - disabled if accept_ra is disabled. + +accept_ra_rt_info_min_plen - INTEGER + Minimum prefix length of Route Information in RA. + + Route Information w/ prefix smaller than this variable shall + be ignored. + + Functional default: + + * 0 if accept_ra_rtr_pref is enabled. + * -1 if accept_ra_rtr_pref is disabled. + +accept_ra_rt_info_max_plen - INTEGER + Maximum prefix length of Route Information in RA. + + Route Information w/ prefix larger than this variable shall + be ignored. + + Functional default: + + * 0 if accept_ra_rtr_pref is enabled. + * -1 if accept_ra_rtr_pref is disabled. + +accept_ra_rtr_pref - BOOLEAN + Accept Router Preference in RA. + + Functional default: + + - enabled if accept_ra is enabled. + - disabled if accept_ra is disabled. + +accept_ra_mtu - BOOLEAN + Apply the MTU value specified in RA option 5 (RFC4861). If + disabled, the MTU specified in the RA will be ignored. + + Functional default: + + - enabled if accept_ra is enabled. + - disabled if accept_ra is disabled. + +accept_redirects - BOOLEAN + Accept Redirects. + + Functional default: + + - enabled if local forwarding is disabled. + - disabled if local forwarding is enabled. + +accept_source_route - INTEGER + Accept source routing (routing extension header). + + - >= 0: Accept only routing header type 2. + - < 0: Do not accept routing header. + + Default: 0 + +autoconf - BOOLEAN + Autoconfigure addresses using Prefix Information in Router + Advertisements. + + Functional default: + + - enabled if accept_ra_pinfo is enabled. + - disabled if accept_ra_pinfo is disabled. + +dad_transmits - INTEGER + The amount of Duplicate Address Detection probes to send. + + Default: 1 + +forwarding - INTEGER + Configure interface-specific Host/Router behaviour. + + .. note:: + + It is recommended to have the same setting on all + interfaces; mixed router/host scenarios are rather uncommon. + + Possible values are: + + - 0 Forwarding disabled + - 1 Forwarding enabled + + **FALSE (0)**: + + By default, Host behaviour is assumed. This means: + + 1. IsRouter flag is not set in Neighbour Advertisements. + 2. If accept_ra is TRUE (default), transmit Router + Solicitations. + 3. If accept_ra is TRUE (default), accept Router + Advertisements (and do autoconfiguration). + 4. If accept_redirects is TRUE (default), accept Redirects. + + **TRUE (1)**: + + If local forwarding is enabled, Router behaviour is assumed. + This means exactly the reverse from the above: + + 1. IsRouter flag is set in Neighbour Advertisements. + 2. Router Solicitations are not sent unless accept_ra is 2. + 3. Router Advertisements are ignored unless accept_ra is 2. + 4. Redirects are ignored. + + Default: 0 (disabled) if global forwarding is disabled (default), + otherwise 1 (enabled). + +hop_limit - INTEGER + Default Hop Limit to set. + + Default: 64 + +mtu - INTEGER + Default Maximum Transfer Unit + + Default: 1280 (IPv6 required minimum) + +ip_nonlocal_bind - BOOLEAN + If set, allows processes to bind() to non-local IPv6 addresses, + which can be quite useful - but may break some applications. + + Default: 0 + +router_probe_interval - INTEGER + Minimum interval (in seconds) between Router Probing described + in RFC4191. + + Default: 60 + +router_solicitation_delay - INTEGER + Number of seconds to wait after interface is brought up + before sending Router Solicitations. + + Default: 1 + +router_solicitation_interval - INTEGER + Number of seconds to wait between Router Solicitations. + + Default: 4 + +router_solicitations - INTEGER + Number of Router Solicitations to send until assuming no + routers are present. + + Default: 3 + +use_oif_addrs_only - BOOLEAN + When enabled, the candidate source addresses for destinations + routed via this interface are restricted to the set of addresses + configured on this interface (vis. RFC 6724, section 4). + + Default: false + +use_tempaddr - INTEGER + Preference for Privacy Extensions (RFC3041). + + * <= 0 : disable Privacy Extensions + * == 1 : enable Privacy Extensions, but prefer public + addresses over temporary addresses. + * > 1 : enable Privacy Extensions and prefer temporary + addresses over public addresses. + + Default: + + * 0 (for most devices) + * -1 (for point-to-point devices and loopback devices) + +temp_valid_lft - INTEGER + valid lifetime (in seconds) for temporary addresses. + + Default: 604800 (7 days) + +temp_prefered_lft - INTEGER + Preferred lifetime (in seconds) for temporary addresses. + + Default: 86400 (1 day) + +keep_addr_on_down - INTEGER + Keep all IPv6 addresses on an interface down event. If set static + global addresses with no expiration time are not flushed. + + * >0 : enabled + * 0 : system default + * <0 : disabled + + Default: 0 (addresses are removed) + +max_desync_factor - INTEGER + Maximum value for DESYNC_FACTOR, which is a random value + that ensures that clients don't synchronize with each + other and generate new addresses at exactly the same time. + value is in seconds. + + Default: 600 + +regen_max_retry - INTEGER + Number of attempts before give up attempting to generate + valid temporary addresses. + + Default: 5 + +max_addresses - INTEGER + Maximum number of autoconfigured addresses per interface. Setting + to zero disables the limitation. It is not recommended to set this + value too large (or to zero) because it would be an easy way to + crash the kernel by allowing too many addresses to be created. + + Default: 16 + +disable_ipv6 - BOOLEAN + Disable IPv6 operation. If accept_dad is set to 2, this value + will be dynamically set to TRUE if DAD fails for the link-local + address. + + Default: FALSE (enable IPv6 operation) + + When this value is changed from 1 to 0 (IPv6 is being enabled), + it will dynamically create a link-local address on the given + interface and start Duplicate Address Detection, if necessary. + + When this value is changed from 0 to 1 (IPv6 is being disabled), + it will dynamically delete all addresses and routes on the given + interface. From now on it will not possible to add addresses/routes + to the selected interface. + +accept_dad - INTEGER + Whether to accept DAD (Duplicate Address Detection). + + == ============================================================== + 0 Disable DAD + 1 Enable DAD (default) + 2 Enable DAD, and disable IPv6 operation if MAC-based duplicate + link-local address has been found. + == ============================================================== + + DAD operation and mode on a given interface will be selected according + to the maximum value of conf/{all,interface}/accept_dad. + +force_tllao - BOOLEAN + Enable sending the target link-layer address option even when + responding to a unicast neighbor solicitation. + + Default: FALSE + + Quoting from RFC 2461, section 4.4, Target link-layer address: + + "The option MUST be included for multicast solicitations in order to + avoid infinite Neighbor Solicitation "recursion" when the peer node + does not have a cache entry to return a Neighbor Advertisements + message. When responding to unicast solicitations, the option can be + omitted since the sender of the solicitation has the correct link- + layer address; otherwise it would not have be able to send the unicast + solicitation in the first place. However, including the link-layer + address in this case adds little overhead and eliminates a potential + race condition where the sender deletes the cached link-layer address + prior to receiving a response to a previous solicitation." + +ndisc_notify - BOOLEAN + Define mode for notification of address and device changes. + + * 0 - (default): do nothing + * 1 - Generate unsolicited neighbour advertisements when device is brought + up or hardware address changes. + +ndisc_tclass - INTEGER + The IPv6 Traffic Class to use by default when sending IPv6 Neighbor + Discovery (Router Solicitation, Router Advertisement, Neighbor + Solicitation, Neighbor Advertisement, Redirect) messages. + These 8 bits can be interpreted as 6 high order bits holding the DSCP + value and 2 low order bits representing ECN (which you probably want + to leave cleared). + + * 0 - (default) + +mldv1_unsolicited_report_interval - INTEGER + The interval in milliseconds in which the next unsolicited + MLDv1 report retransmit will take place. + + Default: 10000 (10 seconds) + +mldv2_unsolicited_report_interval - INTEGER + The interval in milliseconds in which the next unsolicited + MLDv2 report retransmit will take place. + + Default: 1000 (1 second) + +force_mld_version - INTEGER + * 0 - (default) No enforcement of a MLD version, MLDv1 fallback allowed + * 1 - Enforce to use MLD version 1 + * 2 - Enforce to use MLD version 2 + +suppress_frag_ndisc - INTEGER + Control RFC 6980 (Security Implications of IPv6 Fragmentation + with IPv6 Neighbor Discovery) behavior: + + * 1 - (default) discard fragmented neighbor discovery packets + * 0 - allow fragmented neighbor discovery packets + +optimistic_dad - BOOLEAN + Whether to perform Optimistic Duplicate Address Detection (RFC 4429). + + * 0: disabled (default) + * 1: enabled + + Optimistic Duplicate Address Detection for the interface will be enabled + if at least one of conf/{all,interface}/optimistic_dad is set to 1, + it will be disabled otherwise. + +use_optimistic - BOOLEAN + If enabled, do not classify optimistic addresses as deprecated during + source address selection. Preferred addresses will still be chosen + before optimistic addresses, subject to other ranking in the source + address selection algorithm. + + * 0: disabled (default) + * 1: enabled + + This will be enabled if at least one of + conf/{all,interface}/use_optimistic is set to 1, disabled otherwise. + +stable_secret - IPv6 address + This IPv6 address will be used as a secret to generate IPv6 + addresses for link-local addresses and autoconfigured + ones. All addresses generated after setting this secret will + be stable privacy ones by default. This can be changed via the + addrgenmode ip-link. conf/default/stable_secret is used as the + secret for the namespace, the interface specific ones can + overwrite that. Writes to conf/all/stable_secret are refused. + + It is recommended to generate this secret during installation + of a system and keep it stable after that. + + By default the stable secret is unset. + +addr_gen_mode - INTEGER + Defines how link-local and autoconf addresses are generated. + + = ================================================================= + 0 generate address based on EUI64 (default) + 1 do no generate a link-local address, use EUI64 for addresses + generated from autoconf + 2 generate stable privacy addresses, using the secret from + stable_secret (RFC7217) + 3 generate stable privacy addresses, using a random secret if unset + = ================================================================= + +drop_unicast_in_l2_multicast - BOOLEAN + Drop any unicast IPv6 packets that are received in link-layer + multicast (or broadcast) frames. + + By default this is turned off. + +drop_unsolicited_na - BOOLEAN + Drop all unsolicited neighbor advertisements, for example if there's + a known good NA proxy on the network and such frames need not be used + (or in the case of 802.11, must not be used to prevent attacks.) + + By default this is turned off. + +enhanced_dad - BOOLEAN + Include a nonce option in the IPv6 neighbor solicitation messages used for + duplicate address detection per RFC7527. A received DAD NS will only signal + a duplicate address if the nonce is different. This avoids any false + detection of duplicates due to loopback of the NS messages that we send. + The nonce option will be sent on an interface unless both of + conf/{all,interface}/enhanced_dad are set to FALSE. + + Default: TRUE + +``icmp/*``: +=========== + +ratelimit - INTEGER + Limit the maximal rates for sending ICMPv6 messages. + + 0 to disable any limiting, + otherwise the minimal space between responses in milliseconds. + + Default: 1000 + +ratemask - list of comma separated ranges + For ICMPv6 message types matching the ranges in the ratemask, limit + the sending of the message according to ratelimit parameter. + + The format used for both input and output is a comma separated + list of ranges (e.g. "0-127,129" for ICMPv6 message type 0 to 127 and + 129). Writing to the file will clear all previous ranges of ICMPv6 + message types and update the current list with the input. + + Refer to: https://www.iana.org/assignments/icmpv6-parameters/icmpv6-parameters.xhtml + for numerical values of ICMPv6 message types, e.g. echo request is 128 + and echo reply is 129. + + Default: 0-1,3-127 (rate limit ICMPv6 errors except Packet Too Big) + +echo_ignore_all - BOOLEAN + If set non-zero, then the kernel will ignore all ICMP ECHO + requests sent to it over the IPv6 protocol. + + Default: 0 + +echo_ignore_multicast - BOOLEAN + If set non-zero, then the kernel will ignore all ICMP ECHO + requests sent to it over the IPv6 protocol via multicast. + + Default: 0 + +echo_ignore_anycast - BOOLEAN + If set non-zero, then the kernel will ignore all ICMP ECHO + requests sent to it over the IPv6 protocol destined to anycast address. + + Default: 0 + +xfrm6_gc_thresh - INTEGER + (Obsolete since linux-4.14) + The threshold at which we will start garbage collecting for IPv6 + destination cache entries. At twice this value the system will + refuse new allocations. + + +IPv6 Update by: +Pekka Savola +YOSHIFUJI Hideaki / USAGI Project + + +/proc/sys/net/bridge/* Variables: +================================= + +bridge-nf-call-arptables - BOOLEAN + - 1 : pass bridged ARP traffic to arptables' FORWARD chain. + - 0 : disable this. + + Default: 1 + +bridge-nf-call-iptables - BOOLEAN + - 1 : pass bridged IPv4 traffic to iptables' chains. + - 0 : disable this. + + Default: 1 + +bridge-nf-call-ip6tables - BOOLEAN + - 1 : pass bridged IPv6 traffic to ip6tables' chains. + - 0 : disable this. + + Default: 1 + +bridge-nf-filter-vlan-tagged - BOOLEAN + - 1 : pass bridged vlan-tagged ARP/IP/IPv6 traffic to {arp,ip,ip6}tables. + - 0 : disable this. + + Default: 0 + +bridge-nf-filter-pppoe-tagged - BOOLEAN + - 1 : pass bridged pppoe-tagged IP/IPv6 traffic to {ip,ip6}tables. + - 0 : disable this. + + Default: 0 + +bridge-nf-pass-vlan-input-dev - BOOLEAN + - 1: if bridge-nf-filter-vlan-tagged is enabled, try to find a vlan + interface on the bridge and set the netfilter input device to the + vlan. This allows use of e.g. "iptables -i br0.1" and makes the + REDIRECT target work with vlan-on-top-of-bridge interfaces. When no + matching vlan interface is found, or this switch is off, the input + device is set to the bridge interface. + + - 0: disable bridge netfilter vlan interface lookup. + + Default: 0 + +``proc/sys/net/sctp/*`` Variables: +================================== + +addip_enable - BOOLEAN + Enable or disable extension of Dynamic Address Reconfiguration + (ADD-IP) functionality specified in RFC5061. This extension provides + the ability to dynamically add and remove new addresses for the SCTP + associations. + + 1: Enable extension. + + 0: Disable extension. + + Default: 0 + +pf_enable - INTEGER + Enable or disable pf (pf is short for potentially failed) state. A value + of pf_retrans > path_max_retrans also disables pf state. That is, one of + both pf_enable and pf_retrans > path_max_retrans can disable pf state. + Since pf_retrans and path_max_retrans can be changed by userspace + application, sometimes user expects to disable pf state by the value of + pf_retrans > path_max_retrans, but occasionally the value of pf_retrans + or path_max_retrans is changed by the user application, this pf state is + enabled. As such, it is necessary to add this to dynamically enable + and disable pf state. See: + https://datatracker.ietf.org/doc/draft-ietf-tsvwg-sctp-failover for + details. + + 1: Enable pf. + + 0: Disable pf. + + Default: 1 + +pf_expose - INTEGER + Unset or enable/disable pf (pf is short for potentially failed) state + exposure. Applications can control the exposure of the PF path state + in the SCTP_PEER_ADDR_CHANGE event and the SCTP_GET_PEER_ADDR_INFO + sockopt. When it's unset, no SCTP_PEER_ADDR_CHANGE event with + SCTP_ADDR_PF state will be sent and a SCTP_PF-state transport info + can be got via SCTP_GET_PEER_ADDR_INFO sockopt; When it's enabled, + a SCTP_PEER_ADDR_CHANGE event will be sent for a transport becoming + SCTP_PF state and a SCTP_PF-state transport info can be got via + SCTP_GET_PEER_ADDR_INFO sockopt; When it's diabled, no + SCTP_PEER_ADDR_CHANGE event will be sent and it returns -EACCES when + trying to get a SCTP_PF-state transport info via SCTP_GET_PEER_ADDR_INFO + sockopt. + + 0: Unset pf state exposure, Compatible with old applications. + + 1: Disable pf state exposure. + + 2: Enable pf state exposure. + + Default: 0 + +addip_noauth_enable - BOOLEAN + Dynamic Address Reconfiguration (ADD-IP) requires the use of + authentication to protect the operations of adding or removing new + addresses. This requirement is mandated so that unauthorized hosts + would not be able to hijack associations. However, older + implementations may not have implemented this requirement while + allowing the ADD-IP extension. For reasons of interoperability, + we provide this variable to control the enforcement of the + authentication requirement. + + == =============================================================== + 1 Allow ADD-IP extension to be used without authentication. This + should only be set in a closed environment for interoperability + with older implementations. + + 0 Enforce the authentication requirement + == =============================================================== + + Default: 0 + +auth_enable - BOOLEAN + Enable or disable Authenticated Chunks extension. This extension + provides the ability to send and receive authenticated chunks and is + required for secure operation of Dynamic Address Reconfiguration + (ADD-IP) extension. + + - 1: Enable this extension. + - 0: Disable this extension. + + Default: 0 + +prsctp_enable - BOOLEAN + Enable or disable the Partial Reliability extension (RFC3758) which + is used to notify peers that a given DATA should no longer be expected. + + - 1: Enable extension + - 0: Disable + + Default: 1 + +max_burst - INTEGER + The limit of the number of new packets that can be initially sent. It + controls how bursty the generated traffic can be. + + Default: 4 + +association_max_retrans - INTEGER + Set the maximum number for retransmissions that an association can + attempt deciding that the remote end is unreachable. If this value + is exceeded, the association is terminated. + + Default: 10 + +max_init_retransmits - INTEGER + The maximum number of retransmissions of INIT and COOKIE-ECHO chunks + that an association will attempt before declaring the destination + unreachable and terminating. + + Default: 8 + +path_max_retrans - INTEGER + The maximum number of retransmissions that will be attempted on a given + path. Once this threshold is exceeded, the path is considered + unreachable, and new traffic will use a different path when the + association is multihomed. + + Default: 5 + +pf_retrans - INTEGER + The number of retransmissions that will be attempted on a given path + before traffic is redirected to an alternate transport (should one + exist). Note this is distinct from path_max_retrans, as a path that + passes the pf_retrans threshold can still be used. Its only + deprioritized when a transmission path is selected by the stack. This + setting is primarily used to enable fast failover mechanisms without + having to reduce path_max_retrans to a very low value. See: + http://www.ietf.org/id/draft-nishida-tsvwg-sctp-failover-05.txt + for details. Note also that a value of pf_retrans > path_max_retrans + disables this feature. Since both pf_retrans and path_max_retrans can + be changed by userspace application, a variable pf_enable is used to + disable pf state. + + Default: 0 + +ps_retrans - INTEGER + Primary.Switchover.Max.Retrans (PSMR), it's a tunable parameter coming + from section-5 "Primary Path Switchover" in rfc7829. The primary path + will be changed to another active path when the path error counter on + the old primary path exceeds PSMR, so that "the SCTP sender is allowed + to continue data transmission on a new working path even when the old + primary destination address becomes active again". Note this feature + is disabled by initializing 'ps_retrans' per netns as 0xffff by default, + and its value can't be less than 'pf_retrans' when changing by sysctl. + + Default: 0xffff + +rto_initial - INTEGER + The initial round trip timeout value in milliseconds that will be used + in calculating round trip times. This is the initial time interval + for retransmissions. + + Default: 3000 + +rto_max - INTEGER + The maximum value (in milliseconds) of the round trip timeout. This + is the largest time interval that can elapse between retransmissions. + + Default: 60000 + +rto_min - INTEGER + The minimum value (in milliseconds) of the round trip timeout. This + is the smallest time interval the can elapse between retransmissions. + + Default: 1000 + +hb_interval - INTEGER + The interval (in milliseconds) between HEARTBEAT chunks. These chunks + are sent at the specified interval on idle paths to probe the state of + a given path between 2 associations. + + Default: 30000 + +sack_timeout - INTEGER + The amount of time (in milliseconds) that the implementation will wait + to send a SACK. + + Default: 200 + +valid_cookie_life - INTEGER + The default lifetime of the SCTP cookie (in milliseconds). The cookie + is used during association establishment. + + Default: 60000 + +cookie_preserve_enable - BOOLEAN + Enable or disable the ability to extend the lifetime of the SCTP cookie + that is used during the establishment phase of SCTP association + + - 1: Enable cookie lifetime extension. + - 0: Disable + + Default: 1 + +cookie_hmac_alg - STRING + Select the hmac algorithm used when generating the cookie value sent by + a listening sctp socket to a connecting client in the INIT-ACK chunk. + Valid values are: + + * md5 + * sha1 + * none + + Ability to assign md5 or sha1 as the selected alg is predicated on the + configuration of those algorithms at build time (CONFIG_CRYPTO_MD5 and + CONFIG_CRYPTO_SHA1). + + Default: Dependent on configuration. MD5 if available, else SHA1 if + available, else none. + +rcvbuf_policy - INTEGER + Determines if the receive buffer is attributed to the socket or to + association. SCTP supports the capability to create multiple + associations on a single socket. When using this capability, it is + possible that a single stalled association that's buffering a lot + of data may block other associations from delivering their data by + consuming all of the receive buffer space. To work around this, + the rcvbuf_policy could be set to attribute the receiver buffer space + to each association instead of the socket. This prevents the described + blocking. + + - 1: rcvbuf space is per association + - 0: rcvbuf space is per socket + + Default: 0 + +sndbuf_policy - INTEGER + Similar to rcvbuf_policy above, this applies to send buffer space. + + - 1: Send buffer is tracked per association + - 0: Send buffer is tracked per socket. + + Default: 0 + +sctp_mem - vector of 3 INTEGERs: min, pressure, max + Number of pages allowed for queueing by all SCTP sockets. + + min: Below this number of pages SCTP is not bothered about its + memory appetite. When amount of memory allocated by SCTP exceeds + this number, SCTP starts to moderate memory usage. + + pressure: This value was introduced to follow format of tcp_mem. + + max: Number of pages allowed for queueing by all SCTP sockets. + + Default is calculated at boot time from amount of available memory. + +sctp_rmem - vector of 3 INTEGERs: min, default, max + Only the first value ("min") is used, "default" and "max" are + ignored. + + min: Minimal size of receive buffer used by SCTP socket. + It is guaranteed to each SCTP socket (but not association) even + under moderate memory pressure. + + Default: 4K + +sctp_wmem - vector of 3 INTEGERs: min, default, max + Currently this tunable has no effect. + +addr_scope_policy - INTEGER + Control IPv4 address scoping - draft-stewart-tsvwg-sctp-ipv4-00 + + - 0 - Disable IPv4 address scoping + - 1 - Enable IPv4 address scoping + - 2 - Follow draft but allow IPv4 private addresses + - 3 - Follow draft but allow IPv4 link local addresses + + Default: 1 + + +``/proc/sys/net/core/*`` +======================== + + Please see: Documentation/admin-guide/sysctl/net.rst for descriptions of these entries. + + +``/proc/sys/net/unix/*`` +======================== + +max_dgram_qlen - INTEGER + The maximum length of dgram socket receive queue + + Default: 10 + diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt deleted file mode 100644 index 5cdc37c34830..000000000000 --- a/Documentation/networking/ip-sysctl.txt +++ /dev/null @@ -1,2374 +0,0 @@ -/proc/sys/net/ipv4/* Variables: - -ip_forward - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - Forward Packets between interfaces. - - This variable is special, its change resets all configuration - parameters to their default state (RFC1122 for hosts, RFC1812 - for routers) - -ip_default_ttl - INTEGER - Default value of TTL field (Time To Live) for outgoing (but not - forwarded) IP packets. Should be between 1 and 255 inclusive. - Default: 64 (as recommended by RFC1700) - -ip_no_pmtu_disc - INTEGER - Disable Path MTU Discovery. If enabled in mode 1 and a - fragmentation-required ICMP is received, the PMTU to this - destination will be set to min_pmtu (see below). You will need - to raise min_pmtu to the smallest interface MTU on your system - manually if you want to avoid locally generated fragments. - - In mode 2 incoming Path MTU Discovery messages will be - discarded. Outgoing frames are handled the same as in mode 1, - implicitly setting IP_PMTUDISC_DONT on every created socket. - - Mode 3 is a hardened pmtu discover mode. The kernel will only - accept fragmentation-needed errors if the underlying protocol - can verify them besides a plain socket lookup. Current - protocols for which pmtu events will be honored are TCP, SCTP - and DCCP as they verify e.g. the sequence number or the - association. This mode should not be enabled globally but is - only intended to secure e.g. name servers in namespaces where - TCP path mtu must still work but path MTU information of other - protocols should be discarded. If enabled globally this mode - could break other protocols. - - Possible values: 0-3 - Default: FALSE - -min_pmtu - INTEGER - default 552 - minimum discovered Path MTU - -ip_forward_use_pmtu - BOOLEAN - By default we don't trust protocol path MTUs while forwarding - because they could be easily forged and can lead to unwanted - fragmentation by the router. - You only need to enable this if you have user-space software - which tries to discover path mtus by itself and depends on the - kernel honoring this information. This is normally not the - case. - Default: 0 (disabled) - Possible values: - 0 - disabled - 1 - enabled - -fwmark_reflect - BOOLEAN - Controls the fwmark of kernel-generated IPv4 reply packets that are not - associated with a socket for example, TCP RSTs or ICMP echo replies). - If unset, these packets have a fwmark of zero. If set, they have the - fwmark of the packet they are replying to. - Default: 0 - -fib_multipath_use_neigh - BOOLEAN - Use status of existing neighbor entry when determining nexthop for - multipath routes. If disabled, neighbor information is not used and - packets could be directed to a failed nexthop. Only valid for kernels - built with CONFIG_IP_ROUTE_MULTIPATH enabled. - Default: 0 (disabled) - Possible values: - 0 - disabled - 1 - enabled - -fib_multipath_hash_policy - INTEGER - Controls which hash policy to use for multipath routes. Only valid - for kernels built with CONFIG_IP_ROUTE_MULTIPATH enabled. - Default: 0 (Layer 3) - Possible values: - 0 - Layer 3 - 1 - Layer 4 - 2 - Layer 3 or inner Layer 3 if present - -fib_sync_mem - UNSIGNED INTEGER - Amount of dirty memory from fib entries that can be backlogged before - synchronize_rcu is forced. - Default: 512kB Minimum: 64kB Maximum: 64MB - -ip_forward_update_priority - INTEGER - Whether to update SKB priority from "TOS" field in IPv4 header after it - is forwarded. The new SKB priority is mapped from TOS field value - according to an rt_tos2priority table (see e.g. man tc-prio). - Default: 1 (Update priority.) - Possible values: - 0 - Do not update priority. - 1 - Update priority. - -route/max_size - INTEGER - Maximum number of routes allowed in the kernel. Increase - this when using large numbers of interfaces and/or routes. - From linux kernel 3.6 onwards, this is deprecated for ipv4 - as route cache is no longer used. - -neigh/default/gc_thresh1 - INTEGER - Minimum number of entries to keep. Garbage collector will not - purge entries if there are fewer than this number. - Default: 128 - -neigh/default/gc_thresh2 - INTEGER - Threshold when garbage collector becomes more aggressive about - purging entries. Entries older than 5 seconds will be cleared - when over this number. - Default: 512 - -neigh/default/gc_thresh3 - INTEGER - Maximum number of non-PERMANENT neighbor entries allowed. Increase - this when using large numbers of interfaces and when communicating - with large numbers of directly-connected peers. - Default: 1024 - -neigh/default/unres_qlen_bytes - INTEGER - The maximum number of bytes which may be used by packets - queued for each unresolved address by other network layers. - (added in linux 3.3) - Setting negative value is meaningless and will return error. - Default: SK_WMEM_MAX, (same as net.core.wmem_default). - Exact value depends on architecture and kernel options, - but should be enough to allow queuing 256 packets - of medium size. - -neigh/default/unres_qlen - INTEGER - The maximum number of packets which may be queued for each - unresolved address by other network layers. - (deprecated in linux 3.3) : use unres_qlen_bytes instead. - Prior to linux 3.3, the default value is 3 which may cause - unexpected packet loss. The current default value is calculated - according to default value of unres_qlen_bytes and true size of - packet. - Default: 101 - -mtu_expires - INTEGER - Time, in seconds, that cached PMTU information is kept. - -min_adv_mss - INTEGER - The advertised MSS depends on the first hop route MTU, but will - never be lower than this setting. - -IP Fragmentation: - -ipfrag_high_thresh - LONG INTEGER - Maximum memory used to reassemble IP fragments. - -ipfrag_low_thresh - LONG INTEGER - (Obsolete since linux-4.17) - Maximum memory used to reassemble IP fragments before the kernel - begins to remove incomplete fragment queues to free up resources. - The kernel still accepts new fragments for defragmentation. - -ipfrag_time - INTEGER - Time in seconds to keep an IP fragment in memory. - -ipfrag_max_dist - INTEGER - ipfrag_max_dist is a non-negative integer value which defines the - maximum "disorder" which is allowed among fragments which share a - common IP source address. Note that reordering of packets is - not unusual, but if a large number of fragments arrive from a source - IP address while a particular fragment queue remains incomplete, it - probably indicates that one or more fragments belonging to that queue - have been lost. When ipfrag_max_dist is positive, an additional check - is done on fragments before they are added to a reassembly queue - if - ipfrag_max_dist (or more) fragments have arrived from a particular IP - address between additions to any IP fragment queue using that source - address, it's presumed that one or more fragments in the queue are - lost. The existing fragment queue will be dropped, and a new one - started. An ipfrag_max_dist value of zero disables this check. - - Using a very small value, e.g. 1 or 2, for ipfrag_max_dist can - result in unnecessarily dropping fragment queues when normal - reordering of packets occurs, which could lead to poor application - performance. Using a very large value, e.g. 50000, increases the - likelihood of incorrectly reassembling IP fragments that originate - from different IP datagrams, which could result in data corruption. - Default: 64 - -INET peer storage: - -inet_peer_threshold - INTEGER - The approximate size of the storage. Starting from this threshold - entries will be thrown aggressively. This threshold also determines - entries' time-to-live and time intervals between garbage collection - passes. More entries, less time-to-live, less GC interval. - -inet_peer_minttl - INTEGER - Minimum time-to-live of entries. Should be enough to cover fragment - time-to-live on the reassembling side. This minimum time-to-live is - guaranteed if the pool size is less than inet_peer_threshold. - Measured in seconds. - -inet_peer_maxttl - INTEGER - Maximum time-to-live of entries. Unused entries will expire after - this period of time if there is no memory pressure on the pool (i.e. - when the number of entries in the pool is very small). - Measured in seconds. - -TCP variables: - -somaxconn - INTEGER - Limit of socket listen() backlog, known in userspace as SOMAXCONN. - Defaults to 4096. (Was 128 before linux-5.4) - See also tcp_max_syn_backlog for additional tuning for TCP sockets. - -tcp_abort_on_overflow - BOOLEAN - If listening service is too slow to accept new connections, - reset them. Default state is FALSE. It means that if overflow - occurred due to a burst, connection will recover. Enable this - option _only_ if you are really sure that listening daemon - cannot be tuned to accept connections faster. Enabling this - option can harm clients of your server. - -tcp_adv_win_scale - INTEGER - Count buffering overhead as bytes/2^tcp_adv_win_scale - (if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale), - if it is <= 0. - Possible values are [-31, 31], inclusive. - Default: 1 - -tcp_allowed_congestion_control - STRING - Show/set the congestion control choices available to non-privileged - processes. The list is a subset of those listed in - tcp_available_congestion_control. - Default is "reno" and the default setting (tcp_congestion_control). - -tcp_app_win - INTEGER - Reserve max(window/2^tcp_app_win, mss) of window for application - buffer. Value 0 is special, it means that nothing is reserved. - Default: 31 - -tcp_autocorking - BOOLEAN - Enable TCP auto corking : - When applications do consecutive small write()/sendmsg() system calls, - we try to coalesce these small writes as much as possible, to lower - total amount of sent packets. This is done if at least one prior - packet for the flow is waiting in Qdisc queues or device transmit - queue. Applications can still use TCP_CORK for optimal behavior - when they know how/when to uncork their sockets. - Default : 1 - -tcp_available_congestion_control - STRING - Shows the available congestion control choices that are registered. - More congestion control algorithms may be available as modules, - but not loaded. - -tcp_base_mss - INTEGER - The initial value of search_low to be used by the packetization layer - Path MTU discovery (MTU probing). If MTU probing is enabled, - this is the initial MSS used by the connection. - -tcp_mtu_probe_floor - INTEGER - If MTU probing is enabled this caps the minimum MSS used for search_low - for the connection. - - Default : 48 - -tcp_min_snd_mss - INTEGER - TCP SYN and SYNACK messages usually advertise an ADVMSS option, - as described in RFC 1122 and RFC 6691. - If this ADVMSS option is smaller than tcp_min_snd_mss, - it is silently capped to tcp_min_snd_mss. - - Default : 48 (at least 8 bytes of payload per segment) - -tcp_congestion_control - STRING - Set the congestion control algorithm to be used for new - connections. The algorithm "reno" is always available, but - additional choices may be available based on kernel configuration. - Default is set as part of kernel configuration. - For passive connections, the listener congestion control choice - is inherited. - [see setsockopt(listenfd, SOL_TCP, TCP_CONGESTION, "name" ...) ] - -tcp_dsack - BOOLEAN - Allows TCP to send "duplicate" SACKs. - -tcp_early_retrans - INTEGER - Tail loss probe (TLP) converts RTOs occurring due to tail - losses into fast recovery (draft-ietf-tcpm-rack). Note that - TLP requires RACK to function properly (see tcp_recovery below) - Possible values: - 0 disables TLP - 3 or 4 enables TLP - Default: 3 - -tcp_ecn - INTEGER - Control use of Explicit Congestion Notification (ECN) by TCP. - ECN is used only when both ends of the TCP connection indicate - support for it. This feature is useful in avoiding losses due - to congestion by allowing supporting routers to signal - congestion before having to drop packets. - Possible values are: - 0 Disable ECN. Neither initiate nor accept ECN. - 1 Enable ECN when requested by incoming connections and - also request ECN on outgoing connection attempts. - 2 Enable ECN when requested by incoming connections - but do not request ECN on outgoing connections. - Default: 2 - -tcp_ecn_fallback - BOOLEAN - If the kernel detects that ECN connection misbehaves, enable fall - back to non-ECN. Currently, this knob implements the fallback - from RFC3168, section 6.1.1.1., but we reserve that in future, - additional detection mechanisms could be implemented under this - knob. The value is not used, if tcp_ecn or per route (or congestion - control) ECN settings are disabled. - Default: 1 (fallback enabled) - -tcp_fack - BOOLEAN - This is a legacy option, it has no effect anymore. - -tcp_fin_timeout - INTEGER - The length of time an orphaned (no longer referenced by any - application) connection will remain in the FIN_WAIT_2 state - before it is aborted at the local end. While a perfectly - valid "receive only" state for an un-orphaned connection, an - orphaned connection in FIN_WAIT_2 state could otherwise wait - forever for the remote to close its end of the connection. - Cf. tcp_max_orphans - Default: 60 seconds - -tcp_frto - INTEGER - Enables Forward RTO-Recovery (F-RTO) defined in RFC5682. - F-RTO is an enhanced recovery algorithm for TCP retransmission - timeouts. It is particularly beneficial in networks where the - RTT fluctuates (e.g., wireless). F-RTO is sender-side only - modification. It does not require any support from the peer. - - By default it's enabled with a non-zero value. 0 disables F-RTO. - -tcp_fwmark_accept - BOOLEAN - If set, incoming connections to listening sockets that do not have a - socket mark will set the mark of the accepting socket to the fwmark of - the incoming SYN packet. This will cause all packets on that connection - (starting from the first SYNACK) to be sent with that fwmark. The - listening socket's mark is unchanged. Listening sockets that already - have a fwmark set via setsockopt(SOL_SOCKET, SO_MARK, ...) are - unaffected. - - Default: 0 - -tcp_invalid_ratelimit - INTEGER - Limit the maximal rate for sending duplicate acknowledgments - in response to incoming TCP packets that are for an existing - connection but that are invalid due to any of these reasons: - - (a) out-of-window sequence number, - (b) out-of-window acknowledgment number, or - (c) PAWS (Protection Against Wrapped Sequence numbers) check failure - - This can help mitigate simple "ack loop" DoS attacks, wherein - a buggy or malicious middlebox or man-in-the-middle can - rewrite TCP header fields in manner that causes each endpoint - to think that the other is sending invalid TCP segments, thus - causing each side to send an unterminating stream of duplicate - acknowledgments for invalid segments. - - Using 0 disables rate-limiting of dupacks in response to - invalid segments; otherwise this value specifies the minimal - space between sending such dupacks, in milliseconds. - - Default: 500 (milliseconds). - -tcp_keepalive_time - INTEGER - How often TCP sends out keepalive messages when keepalive is enabled. - Default: 2hours. - -tcp_keepalive_probes - INTEGER - How many keepalive probes TCP sends out, until it decides that the - connection is broken. Default value: 9. - -tcp_keepalive_intvl - INTEGER - How frequently the probes are send out. Multiplied by - tcp_keepalive_probes it is time to kill not responding connection, - after probes started. Default value: 75sec i.e. connection - will be aborted after ~11 minutes of retries. - -tcp_l3mdev_accept - BOOLEAN - Enables child sockets to inherit the L3 master device index. - Enabling this option allows a "global" listen socket to work - across L3 master domains (e.g., VRFs) with connected sockets - derived from the listen socket to be bound to the L3 domain in - which the packets originated. Only valid when the kernel was - compiled with CONFIG_NET_L3_MASTER_DEV. - Default: 0 (disabled) - -tcp_low_latency - BOOLEAN - This is a legacy option, it has no effect anymore. - -tcp_max_orphans - INTEGER - Maximal number of TCP sockets not attached to any user file handle, - held by system. If this number is exceeded orphaned connections are - reset immediately and warning is printed. This limit exists - only to prevent simple DoS attacks, you _must_ not rely on this - or lower the limit artificially, but rather increase it - (probably, after increasing installed memory), - if network conditions require more than default value, - and tune network services to linger and kill such states - more aggressively. Let me to remind again: each orphan eats - up to ~64K of unswappable memory. - -tcp_max_syn_backlog - INTEGER - Maximal number of remembered connection requests (SYN_RECV), - which have not received an acknowledgment from connecting client. - This is a per-listener limit. - The minimal value is 128 for low memory machines, and it will - increase in proportion to the memory of machine. - If server suffers from overload, try increasing this number. - Remember to also check /proc/sys/net/core/somaxconn - A SYN_RECV request socket consumes about 304 bytes of memory. - -tcp_max_tw_buckets - INTEGER - Maximal number of timewait sockets held by system simultaneously. - If this number is exceeded time-wait socket is immediately destroyed - and warning is printed. This limit exists only to prevent - simple DoS attacks, you _must_ not lower the limit artificially, - but rather increase it (probably, after increasing installed memory), - if network conditions require more than default value. - -tcp_mem - vector of 3 INTEGERs: min, pressure, max - min: below this number of pages TCP is not bothered about its - memory appetite. - - pressure: when amount of memory allocated by TCP exceeds this number - of pages, TCP moderates its memory consumption and enters memory - pressure mode, which is exited when memory consumption falls - under "min". - - max: number of pages allowed for queueing by all TCP sockets. - - Defaults are calculated at boot time from amount of available - memory. - -tcp_min_rtt_wlen - INTEGER - The window length of the windowed min filter to track the minimum RTT. - A shorter window lets a flow more quickly pick up new (higher) - minimum RTT when it is moved to a longer path (e.g., due to traffic - engineering). A longer window makes the filter more resistant to RTT - inflations such as transient congestion. The unit is seconds. - Possible values: 0 - 86400 (1 day) - Default: 300 - -tcp_moderate_rcvbuf - BOOLEAN - If set, TCP performs receive buffer auto-tuning, attempting to - automatically size the buffer (no greater than tcp_rmem[2]) to - match the size required by the path for full throughput. Enabled by - default. - -tcp_mtu_probing - INTEGER - Controls TCP Packetization-Layer Path MTU Discovery. Takes three - values: - 0 - Disabled - 1 - Disabled by default, enabled when an ICMP black hole detected - 2 - Always enabled, use initial MSS of tcp_base_mss. - -tcp_probe_interval - UNSIGNED INTEGER - Controls how often to start TCP Packetization-Layer Path MTU - Discovery reprobe. The default is reprobing every 10 minutes as - per RFC4821. - -tcp_probe_threshold - INTEGER - Controls when TCP Packetization-Layer Path MTU Discovery probing - will stop in respect to the width of search range in bytes. Default - is 8 bytes. - -tcp_no_metrics_save - BOOLEAN - By default, TCP saves various connection metrics in the route cache - when the connection closes, so that connections established in the - near future can use these to set initial conditions. Usually, this - increases overall performance, but may sometimes cause performance - degradation. If set, TCP will not cache metrics on closing - connections. - -tcp_no_ssthresh_metrics_save - BOOLEAN - Controls whether TCP saves ssthresh metrics in the route cache. - Default is 1, which disables ssthresh metrics. - -tcp_orphan_retries - INTEGER - This value influences the timeout of a locally closed TCP connection, - when RTO retransmissions remain unacknowledged. - See tcp_retries2 for more details. - - The default value is 8. - If your machine is a loaded WEB server, - you should think about lowering this value, such sockets - may consume significant resources. Cf. tcp_max_orphans. - -tcp_recovery - INTEGER - This value is a bitmap to enable various experimental loss recovery - features. - - RACK: 0x1 enables the RACK loss detection for fast detection of lost - retransmissions and tail drops. It also subsumes and disables - RFC6675 recovery for SACK connections. - RACK: 0x2 makes RACK's reordering window static (min_rtt/4). - RACK: 0x4 disables RACK's DUPACK threshold heuristic - - Default: 0x1 - -tcp_reordering - INTEGER - Initial reordering level of packets in a TCP stream. - TCP stack can then dynamically adjust flow reordering level - between this initial value and tcp_max_reordering - Default: 3 - -tcp_max_reordering - INTEGER - Maximal reordering level of packets in a TCP stream. - 300 is a fairly conservative value, but you might increase it - if paths are using per packet load balancing (like bonding rr mode) - Default: 300 - -tcp_retrans_collapse - BOOLEAN - Bug-to-bug compatibility with some broken printers. - On retransmit try to send bigger packets to work around bugs in - certain TCP stacks. - -tcp_retries1 - INTEGER - This value influences the time, after which TCP decides, that - something is wrong due to unacknowledged RTO retransmissions, - and reports this suspicion to the network layer. - See tcp_retries2 for more details. - - RFC 1122 recommends at least 3 retransmissions, which is the - default. - -tcp_retries2 - INTEGER - This value influences the timeout of an alive TCP connection, - when RTO retransmissions remain unacknowledged. - Given a value of N, a hypothetical TCP connection following - exponential backoff with an initial RTO of TCP_RTO_MIN would - retransmit N times before killing the connection at the (N+1)th RTO. - - The default value of 15 yields a hypothetical timeout of 924.6 - seconds and is a lower bound for the effective timeout. - TCP will effectively time out at the first RTO which exceeds the - hypothetical timeout. - - RFC 1122 recommends at least 100 seconds for the timeout, - which corresponds to a value of at least 8. - -tcp_rfc1337 - BOOLEAN - If set, the TCP stack behaves conforming to RFC1337. If unset, - we are not conforming to RFC, but prevent TCP TIME_WAIT - assassination. - Default: 0 - -tcp_rmem - vector of 3 INTEGERs: min, default, max - min: Minimal size of receive buffer used by TCP sockets. - It is guaranteed to each TCP socket, even under moderate memory - pressure. - Default: 4K - - default: initial size of receive buffer used by TCP sockets. - This value overrides net.core.rmem_default used by other protocols. - Default: 87380 bytes. This value results in window of 65535 with - default setting of tcp_adv_win_scale and tcp_app_win:0 and a bit - less for default tcp_app_win. See below about these variables. - - max: maximal size of receive buffer allowed for automatically - selected receiver buffers for TCP socket. This value does not override - net.core.rmem_max. Calling setsockopt() with SO_RCVBUF disables - automatic tuning of that socket's receive buffer size, in which - case this value is ignored. - Default: between 87380B and 6MB, depending on RAM size. - -tcp_sack - BOOLEAN - Enable select acknowledgments (SACKS). - -tcp_comp_sack_delay_ns - LONG INTEGER - TCP tries to reduce number of SACK sent, using a timer - based on 5% of SRTT, capped by this sysctl, in nano seconds. - The default is 1ms, based on TSO autosizing period. - - Default : 1,000,000 ns (1 ms) - -tcp_comp_sack_nr - INTEGER - Max number of SACK that can be compressed. - Using 0 disables SACK compression. - - Default : 44 - -tcp_slow_start_after_idle - BOOLEAN - If set, provide RFC2861 behavior and time out the congestion - window after an idle period. An idle period is defined at - the current RTO. If unset, the congestion window will not - be timed out after an idle period. - Default: 1 - -tcp_stdurg - BOOLEAN - Use the Host requirements interpretation of the TCP urgent pointer field. - Most hosts use the older BSD interpretation, so if you turn this on - Linux might not communicate correctly with them. - Default: FALSE - -tcp_synack_retries - INTEGER - Number of times SYNACKs for a passive TCP connection attempt will - be retransmitted. Should not be higher than 255. Default value - is 5, which corresponds to 31seconds till the last retransmission - with the current initial RTO of 1second. With this the final timeout - for a passive TCP connection will happen after 63seconds. - -tcp_syncookies - INTEGER - Only valid when the kernel was compiled with CONFIG_SYN_COOKIES - Send out syncookies when the syn backlog queue of a socket - overflows. This is to prevent against the common 'SYN flood attack' - Default: 1 - - Note, that syncookies is fallback facility. - It MUST NOT be used to help highly loaded servers to stand - against legal connection rate. If you see SYN flood warnings - in your logs, but investigation shows that they occur - because of overload with legal connections, you should tune - another parameters until this warning disappear. - See: tcp_max_syn_backlog, tcp_synack_retries, tcp_abort_on_overflow. - - syncookies seriously violate TCP protocol, do not allow - to use TCP extensions, can result in serious degradation - of some services (f.e. SMTP relaying), visible not by you, - but your clients and relays, contacting you. While you see - SYN flood warnings in logs not being really flooded, your server - is seriously misconfigured. - - If you want to test which effects syncookies have to your - network connections you can set this knob to 2 to enable - unconditionally generation of syncookies. - -tcp_fastopen - INTEGER - Enable TCP Fast Open (RFC7413) to send and accept data in the opening - SYN packet. - - The client support is enabled by flag 0x1 (on by default). The client - then must use sendmsg() or sendto() with the MSG_FASTOPEN flag, - rather than connect() to send data in SYN. - - The server support is enabled by flag 0x2 (off by default). Then - either enable for all listeners with another flag (0x400) or - enable individual listeners via TCP_FASTOPEN socket option with - the option value being the length of the syn-data backlog. - - The values (bitmap) are - 0x1: (client) enables sending data in the opening SYN on the client. - 0x2: (server) enables the server support, i.e., allowing data in - a SYN packet to be accepted and passed to the - application before 3-way handshake finishes. - 0x4: (client) send data in the opening SYN regardless of cookie - availability and without a cookie option. - 0x200: (server) accept data-in-SYN w/o any cookie option present. - 0x400: (server) enable all listeners to support Fast Open by - default without explicit TCP_FASTOPEN socket option. - - Default: 0x1 - - Note that that additional client or server features are only - effective if the basic support (0x1 and 0x2) are enabled respectively. - -tcp_fastopen_blackhole_timeout_sec - INTEGER - Initial time period in second to disable Fastopen on active TCP sockets - when a TFO firewall blackhole issue happens. - This time period will grow exponentially when more blackhole issues - get detected right after Fastopen is re-enabled and will reset to - initial value when the blackhole issue goes away. - 0 to disable the blackhole detection. - By default, it is set to 1hr. - -tcp_fastopen_key - list of comma separated 32-digit hexadecimal INTEGERs - The list consists of a primary key and an optional backup key. The - primary key is used for both creating and validating cookies, while the - optional backup key is only used for validating cookies. The purpose of - the backup key is to maximize TFO validation when keys are rotated. - - A randomly chosen primary key may be configured by the kernel if - the tcp_fastopen sysctl is set to 0x400 (see above), or if the - TCP_FASTOPEN setsockopt() optname is set and a key has not been - previously configured via sysctl. If keys are configured via - setsockopt() by using the TCP_FASTOPEN_KEY optname, then those - per-socket keys will be used instead of any keys that are specified via - sysctl. - - A key is specified as 4 8-digit hexadecimal integers which are separated - by a '-' as: xxxxxxxx-xxxxxxxx-xxxxxxxx-xxxxxxxx. Leading zeros may be - omitted. A primary and a backup key may be specified by separating them - by a comma. If only one key is specified, it becomes the primary key and - any previously configured backup keys are removed. - -tcp_syn_retries - INTEGER - Number of times initial SYNs for an active TCP connection attempt - will be retransmitted. Should not be higher than 127. Default value - is 6, which corresponds to 63seconds till the last retransmission - with the current initial RTO of 1second. With this the final timeout - for an active TCP connection attempt will happen after 127seconds. - -tcp_timestamps - INTEGER -Enable timestamps as defined in RFC1323. - 0: Disabled. - 1: Enable timestamps as defined in RFC1323 and use random offset for - each connection rather than only using the current time. - 2: Like 1, but without random offsets. - Default: 1 - -tcp_min_tso_segs - INTEGER - Minimal number of segments per TSO frame. - Since linux-3.12, TCP does an automatic sizing of TSO frames, - depending on flow rate, instead of filling 64Kbytes packets. - For specific usages, it's possible to force TCP to build big - TSO frames. Note that TCP stack might split too big TSO packets - if available window is too small. - Default: 2 - -tcp_pacing_ss_ratio - INTEGER - sk->sk_pacing_rate is set by TCP stack using a ratio applied - to current rate. (current_rate = cwnd * mss / srtt) - If TCP is in slow start, tcp_pacing_ss_ratio is applied - to let TCP probe for bigger speeds, assuming cwnd can be - doubled every other RTT. - Default: 200 - -tcp_pacing_ca_ratio - INTEGER - sk->sk_pacing_rate is set by TCP stack using a ratio applied - to current rate. (current_rate = cwnd * mss / srtt) - If TCP is in congestion avoidance phase, tcp_pacing_ca_ratio - is applied to conservatively probe for bigger throughput. - Default: 120 - -tcp_tso_win_divisor - INTEGER - This allows control over what percentage of the congestion window - can be consumed by a single TSO frame. - The setting of this parameter is a choice between burstiness and - building larger TSO frames. - Default: 3 - -tcp_tw_reuse - INTEGER - Enable reuse of TIME-WAIT sockets for new connections when it is - safe from protocol viewpoint. - 0 - disable - 1 - global enable - 2 - enable for loopback traffic only - It should not be changed without advice/request of technical - experts. - Default: 2 - -tcp_window_scaling - BOOLEAN - Enable window scaling as defined in RFC1323. - -tcp_wmem - vector of 3 INTEGERs: min, default, max - min: Amount of memory reserved for send buffers for TCP sockets. - Each TCP socket has rights to use it due to fact of its birth. - Default: 4K - - default: initial size of send buffer used by TCP sockets. This - value overrides net.core.wmem_default used by other protocols. - It is usually lower than net.core.wmem_default. - Default: 16K - - max: Maximal amount of memory allowed for automatically tuned - send buffers for TCP sockets. This value does not override - net.core.wmem_max. Calling setsockopt() with SO_SNDBUF disables - automatic tuning of that socket's send buffer size, in which case - this value is ignored. - Default: between 64K and 4MB, depending on RAM size. - -tcp_notsent_lowat - UNSIGNED INTEGER - A TCP socket can control the amount of unsent bytes in its write queue, - thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll() - reports POLLOUT events if the amount of unsent bytes is below a per - socket value, and if the write queue is not full. sendmsg() will - also not add new buffers if the limit is hit. - - This global variable controls the amount of unsent data for - sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change - to the global variable has immediate effect. - - Default: UINT_MAX (0xFFFFFFFF) - -tcp_workaround_signed_windows - BOOLEAN - If set, assume no receipt of a window scaling option means the - remote TCP is broken and treats the window as a signed quantity. - If unset, assume the remote TCP is not broken even if we do - not receive a window scaling option from them. - Default: 0 - -tcp_thin_linear_timeouts - BOOLEAN - Enable dynamic triggering of linear timeouts for thin streams. - If set, a check is performed upon retransmission by timeout to - determine if the stream is thin (less than 4 packets in flight). - As long as the stream is found to be thin, up to 6 linear - timeouts may be performed before exponential backoff mode is - initiated. This improves retransmission latency for - non-aggressive thin streams, often found to be time-dependent. - For more information on thin streams, see - Documentation/networking/tcp-thin.txt - Default: 0 - -tcp_limit_output_bytes - INTEGER - Controls TCP Small Queue limit per tcp socket. - TCP bulk sender tends to increase packets in flight until it - gets losses notifications. With SNDBUF autotuning, this can - result in a large amount of packets queued on the local machine - (e.g.: qdiscs, CPU backlog, or device) hurting latency of other - flows, for typical pfifo_fast qdiscs. tcp_limit_output_bytes - limits the number of bytes on qdisc or device to reduce artificial - RTT/cwnd and reduce bufferbloat. - Default: 1048576 (16 * 65536) - -tcp_challenge_ack_limit - INTEGER - Limits number of Challenge ACK sent per second, as recommended - in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks) - Default: 1000 - -tcp_rx_skb_cache - BOOLEAN - Controls a per TCP socket cache of one skb, that might help - performance of some workloads. This might be dangerous - on systems with a lot of TCP sockets, since it increases - memory usage. - - Default: 0 (disabled) - -UDP variables: - -udp_l3mdev_accept - BOOLEAN - Enabling this option allows a "global" bound socket to work - across L3 master domains (e.g., VRFs) with packets capable of - being received regardless of the L3 domain in which they - originated. Only valid when the kernel was compiled with - CONFIG_NET_L3_MASTER_DEV. - Default: 0 (disabled) - -udp_mem - vector of 3 INTEGERs: min, pressure, max - Number of pages allowed for queueing by all UDP sockets. - - min: Below this number of pages UDP is not bothered about its - memory appetite. When amount of memory allocated by UDP exceeds - this number, UDP starts to moderate memory usage. - - pressure: This value was introduced to follow format of tcp_mem. - - max: Number of pages allowed for queueing by all UDP sockets. - - Default is calculated at boot time from amount of available memory. - -udp_rmem_min - INTEGER - Minimal size of receive buffer used by UDP sockets in moderation. - Each UDP socket is able to use the size for receiving data, even if - total pages of UDP sockets exceed udp_mem pressure. The unit is byte. - Default: 4K - -udp_wmem_min - INTEGER - Minimal size of send buffer used by UDP sockets in moderation. - Each UDP socket is able to use the size for sending data, even if - total pages of UDP sockets exceed udp_mem pressure. The unit is byte. - Default: 4K - -RAW variables: - -raw_l3mdev_accept - BOOLEAN - Enabling this option allows a "global" bound socket to work - across L3 master domains (e.g., VRFs) with packets capable of - being received regardless of the L3 domain in which they - originated. Only valid when the kernel was compiled with - CONFIG_NET_L3_MASTER_DEV. - Default: 1 (enabled) - -CIPSOv4 Variables: - -cipso_cache_enable - BOOLEAN - If set, enable additions to and lookups from the CIPSO label mapping - cache. If unset, additions are ignored and lookups always result in a - miss. However, regardless of the setting the cache is still - invalidated when required when means you can safely toggle this on and - off and the cache will always be "safe". - Default: 1 - -cipso_cache_bucket_size - INTEGER - The CIPSO label cache consists of a fixed size hash table with each - hash bucket containing a number of cache entries. This variable limits - the number of entries in each hash bucket; the larger the value the - more CIPSO label mappings that can be cached. When the number of - entries in a given hash bucket reaches this limit adding new entries - causes the oldest entry in the bucket to be removed to make room. - Default: 10 - -cipso_rbm_optfmt - BOOLEAN - Enable the "Optimized Tag 1 Format" as defined in section 3.4.2.6 of - the CIPSO draft specification (see Documentation/netlabel for details). - This means that when set the CIPSO tag will be padded with empty - categories in order to make the packet data 32-bit aligned. - Default: 0 - -cipso_rbm_structvalid - BOOLEAN - If set, do a very strict check of the CIPSO option when - ip_options_compile() is called. If unset, relax the checks done during - ip_options_compile(). Either way is "safe" as errors are caught else - where in the CIPSO processing code but setting this to 0 (False) should - result in less work (i.e. it should be faster) but could cause problems - with other implementations that require strict checking. - Default: 0 - -IP Variables: - -ip_local_port_range - 2 INTEGERS - Defines the local port range that is used by TCP and UDP to - choose the local port. The first number is the first, the - second the last local port number. - If possible, it is better these numbers have different parity - (one even and one odd value). - Must be greater than or equal to ip_unprivileged_port_start. - The default values are 32768 and 60999 respectively. - -ip_local_reserved_ports - list of comma separated ranges - Specify the ports which are reserved for known third-party - applications. These ports will not be used by automatic port - assignments (e.g. when calling connect() or bind() with port - number 0). Explicit port allocation behavior is unchanged. - - The format used for both input and output is a comma separated - list of ranges (e.g. "1,2-4,10-10" for ports 1, 2, 3, 4 and - 10). Writing to the file will clear all previously reserved - ports and update the current list with the one given in the - input. - - Note that ip_local_port_range and ip_local_reserved_ports - settings are independent and both are considered by the kernel - when determining which ports are available for automatic port - assignments. - - You can reserve ports which are not in the current - ip_local_port_range, e.g.: - - $ cat /proc/sys/net/ipv4/ip_local_port_range - 32000 60999 - $ cat /proc/sys/net/ipv4/ip_local_reserved_ports - 8080,9148 - - although this is redundant. However such a setting is useful - if later the port range is changed to a value that will - include the reserved ports. - - Default: Empty - -ip_unprivileged_port_start - INTEGER - This is a per-namespace sysctl. It defines the first - unprivileged port in the network namespace. Privileged ports - require root or CAP_NET_BIND_SERVICE in order to bind to them. - To disable all privileged ports, set this to 0. They must not - overlap with the ip_local_port_range. - - Default: 1024 - -ip_nonlocal_bind - BOOLEAN - If set, allows processes to bind() to non-local IP addresses, - which can be quite useful - but may break some applications. - Default: 0 - -ip_autobind_reuse - BOOLEAN - By default, bind() does not select the ports automatically even if - the new socket and all sockets bound to the port have SO_REUSEADDR. - ip_autobind_reuse allows bind() to reuse the port and this is useful - when you use bind()+connect(), but may break some applications. - The preferred solution is to use IP_BIND_ADDRESS_NO_PORT and this - option should only be set by experts. - Default: 0 - -ip_dynaddr - BOOLEAN - If set non-zero, enables support for dynamic addresses. - If set to a non-zero value larger than 1, a kernel log - message will be printed when dynamic address rewriting - occurs. - Default: 0 - -ip_early_demux - BOOLEAN - Optimize input packet processing down to one demux for - certain kinds of local sockets. Currently we only do this - for established TCP and connected UDP sockets. - - It may add an additional cost for pure routing workloads that - reduces overall throughput, in such case you should disable it. - Default: 1 - -ping_group_range - 2 INTEGERS - Restrict ICMP_PROTO datagram sockets to users in the group range. - The default is "1 0", meaning, that nobody (not even root) may - create ping sockets. Setting it to "100 100" would grant permissions - to the single group. "0 4294967295" would enable it for the world, "100 - 4294967295" would enable it for the users, but not daemons. - -tcp_early_demux - BOOLEAN - Enable early demux for established TCP sockets. - Default: 1 - -udp_early_demux - BOOLEAN - Enable early demux for connected UDP sockets. Disable this if - your system could experience more unconnected load. - Default: 1 - -icmp_echo_ignore_all - BOOLEAN - If set non-zero, then the kernel will ignore all ICMP ECHO - requests sent to it. - Default: 0 - -icmp_echo_ignore_broadcasts - BOOLEAN - If set non-zero, then the kernel will ignore all ICMP ECHO and - TIMESTAMP requests sent to it via broadcast/multicast. - Default: 1 - -icmp_ratelimit - INTEGER - Limit the maximal rates for sending ICMP packets whose type matches - icmp_ratemask (see below) to specific targets. - 0 to disable any limiting, - otherwise the minimal space between responses in milliseconds. - Note that another sysctl, icmp_msgs_per_sec limits the number - of ICMP packets sent on all targets. - Default: 1000 - -icmp_msgs_per_sec - INTEGER - Limit maximal number of ICMP packets sent per second from this host. - Only messages whose type matches icmp_ratemask (see below) are - controlled by this limit. - Default: 1000 - -icmp_msgs_burst - INTEGER - icmp_msgs_per_sec controls number of ICMP packets sent per second, - while icmp_msgs_burst controls the burst size of these packets. - Default: 50 - -icmp_ratemask - INTEGER - Mask made of ICMP types for which rates are being limited. - Significant bits: IHGFEDCBA9876543210 - Default mask: 0000001100000011000 (6168) - - Bit definitions (see include/linux/icmp.h): - 0 Echo Reply - 3 Destination Unreachable * - 4 Source Quench * - 5 Redirect - 8 Echo Request - B Time Exceeded * - C Parameter Problem * - D Timestamp Request - E Timestamp Reply - F Info Request - G Info Reply - H Address Mask Request - I Address Mask Reply - - * These are rate limited by default (see default mask above) - -icmp_ignore_bogus_error_responses - BOOLEAN - Some routers violate RFC1122 by sending bogus responses to broadcast - frames. Such violations are normally logged via a kernel warning. - If this is set to TRUE, the kernel will not give such warnings, which - will avoid log file clutter. - Default: 1 - -icmp_errors_use_inbound_ifaddr - BOOLEAN - - If zero, icmp error messages are sent with the primary address of - the exiting interface. - - If non-zero, the message will be sent with the primary address of - the interface that received the packet that caused the icmp error. - This is the behaviour network many administrators will expect from - a router. And it can make debugging complicated network layouts - much easier. - - Note that if no primary address exists for the interface selected, - then the primary address of the first non-loopback interface that - has one will be used regardless of this setting. - - Default: 0 - -igmp_max_memberships - INTEGER - Change the maximum number of multicast groups we can subscribe to. - Default: 20 - - Theoretical maximum value is bounded by having to send a membership - report in a single datagram (i.e. the report can't span multiple - datagrams, or risk confusing the switch and leaving groups you don't - intend to). - - The number of supported groups 'M' is bounded by the number of group - report entries you can fit into a single datagram of 65535 bytes. - - M = 65536-sizeof (ip header)/(sizeof(Group record)) - - Group records are variable length, with a minimum of 12 bytes. - So net.ipv4.igmp_max_memberships should not be set higher than: - - (65536-24) / 12 = 5459 - - The value 5459 assumes no IP header options, so in practice - this number may be lower. - -igmp_max_msf - INTEGER - Maximum number of addresses allowed in the source filter list for a - multicast group. - Default: 10 - -igmp_qrv - INTEGER - Controls the IGMP query robustness variable (see RFC2236 8.1). - Default: 2 (as specified by RFC2236 8.1) - Minimum: 1 (as specified by RFC6636 4.5) - -force_igmp_version - INTEGER - 0 - (default) No enforcement of a IGMP version, IGMPv1/v2 fallback - allowed. Will back to IGMPv3 mode again if all IGMPv1/v2 Querier - Present timer expires. - 1 - Enforce to use IGMP version 1. Will also reply IGMPv1 report if - receive IGMPv2/v3 query. - 2 - Enforce to use IGMP version 2. Will fallback to IGMPv1 if receive - IGMPv1 query message. Will reply report if receive IGMPv3 query. - 3 - Enforce to use IGMP version 3. The same react with default 0. - - Note: this is not the same with force_mld_version because IGMPv3 RFC3376 - Security Considerations does not have clear description that we could - ignore other version messages completely as MLDv2 RFC3810. So make - this value as default 0 is recommended. - -conf/interface/* changes special settings per interface (where -"interface" is the name of your network interface) - -conf/all/* is special, changes the settings for all interfaces - -log_martians - BOOLEAN - Log packets with impossible addresses to kernel log. - log_martians for the interface will be enabled if at least one of - conf/{all,interface}/log_martians is set to TRUE, - it will be disabled otherwise - -accept_redirects - BOOLEAN - Accept ICMP redirect messages. - accept_redirects for the interface will be enabled if: - - both conf/{all,interface}/accept_redirects are TRUE in the case - forwarding for the interface is enabled - or - - at least one of conf/{all,interface}/accept_redirects is TRUE in the - case forwarding for the interface is disabled - accept_redirects for the interface will be disabled otherwise - default TRUE (host) - FALSE (router) - -forwarding - BOOLEAN - Enable IP forwarding on this interface. This controls whether packets - received _on_ this interface can be forwarded. - -mc_forwarding - BOOLEAN - Do multicast routing. The kernel needs to be compiled with CONFIG_MROUTE - and a multicast routing daemon is required. - conf/all/mc_forwarding must also be set to TRUE to enable multicast - routing for the interface - -medium_id - INTEGER - Integer value used to differentiate the devices by the medium they - are attached to. Two devices can have different id values when - the broadcast packets are received only on one of them. - The default value 0 means that the device is the only interface - to its medium, value of -1 means that medium is not known. - - Currently, it is used to change the proxy_arp behavior: - the proxy_arp feature is enabled for packets forwarded between - two devices attached to different media. - -proxy_arp - BOOLEAN - Do proxy arp. - proxy_arp for the interface will be enabled if at least one of - conf/{all,interface}/proxy_arp is set to TRUE, - it will be disabled otherwise - -proxy_arp_pvlan - BOOLEAN - Private VLAN proxy arp. - Basically allow proxy arp replies back to the same interface - (from which the ARP request/solicitation was received). - - This is done to support (ethernet) switch features, like RFC - 3069, where the individual ports are NOT allowed to - communicate with each other, but they are allowed to talk to - the upstream router. As described in RFC 3069, it is possible - to allow these hosts to communicate through the upstream - router by proxy_arp'ing. Don't need to be used together with - proxy_arp. - - This technology is known by different names: - In RFC 3069 it is called VLAN Aggregation. - Cisco and Allied Telesyn call it Private VLAN. - Hewlett-Packard call it Source-Port filtering or port-isolation. - Ericsson call it MAC-Forced Forwarding (RFC Draft). - -shared_media - BOOLEAN - Send(router) or accept(host) RFC1620 shared media redirects. - Overrides secure_redirects. - shared_media for the interface will be enabled if at least one of - conf/{all,interface}/shared_media is set to TRUE, - it will be disabled otherwise - default TRUE - -secure_redirects - BOOLEAN - Accept ICMP redirect messages only to gateways listed in the - interface's current gateway list. Even if disabled, RFC1122 redirect - rules still apply. - Overridden by shared_media. - secure_redirects for the interface will be enabled if at least one of - conf/{all,interface}/secure_redirects is set to TRUE, - it will be disabled otherwise - default TRUE - -send_redirects - BOOLEAN - Send redirects, if router. - send_redirects for the interface will be enabled if at least one of - conf/{all,interface}/send_redirects is set to TRUE, - it will be disabled otherwise - Default: TRUE - -bootp_relay - BOOLEAN - Accept packets with source address 0.b.c.d destined - not to this host as local ones. It is supposed, that - BOOTP relay daemon will catch and forward such packets. - conf/all/bootp_relay must also be set to TRUE to enable BOOTP relay - for the interface - default FALSE - Not Implemented Yet. - -accept_source_route - BOOLEAN - Accept packets with SRR option. - conf/all/accept_source_route must also be set to TRUE to accept packets - with SRR option on the interface - default TRUE (router) - FALSE (host) - -accept_local - BOOLEAN - Accept packets with local source addresses. In combination with - suitable routing, this can be used to direct packets between two - local interfaces over the wire and have them accepted properly. - default FALSE - -route_localnet - BOOLEAN - Do not consider loopback addresses as martian source or destination - while routing. This enables the use of 127/8 for local routing purposes. - default FALSE - -rp_filter - INTEGER - 0 - No source validation. - 1 - Strict mode as defined in RFC3704 Strict Reverse Path - Each incoming packet is tested against the FIB and if the interface - is not the best reverse path the packet check will fail. - By default failed packets are discarded. - 2 - Loose mode as defined in RFC3704 Loose Reverse Path - Each incoming packet's source address is also tested against the FIB - and if the source address is not reachable via any interface - the packet check will fail. - - Current recommended practice in RFC3704 is to enable strict mode - to prevent IP spoofing from DDos attacks. If using asymmetric routing - or other complicated routing, then loose mode is recommended. - - The max value from conf/{all,interface}/rp_filter is used - when doing source validation on the {interface}. - - Default value is 0. Note that some distributions enable it - in startup scripts. - -arp_filter - BOOLEAN - 1 - Allows you to have multiple network interfaces on the same - subnet, and have the ARPs for each interface be answered - based on whether or not the kernel would route a packet from - the ARP'd IP out that interface (therefore you must use source - based routing for this to work). In other words it allows control - of which cards (usually 1) will respond to an arp request. - - 0 - (default) The kernel can respond to arp requests with addresses - from other interfaces. This may seem wrong but it usually makes - sense, because it increases the chance of successful communication. - IP addresses are owned by the complete host on Linux, not by - particular interfaces. Only for more complex setups like load- - balancing, does this behaviour cause problems. - - arp_filter for the interface will be enabled if at least one of - conf/{all,interface}/arp_filter is set to TRUE, - it will be disabled otherwise - -arp_announce - INTEGER - Define different restriction levels for announcing the local - source IP address from IP packets in ARP requests sent on - interface: - 0 - (default) Use any local address, configured on any interface - 1 - Try to avoid local addresses that are not in the target's - subnet for this interface. This mode is useful when target - hosts reachable via this interface require the source IP - address in ARP requests to be part of their logical network - configured on the receiving interface. When we generate the - request we will check all our subnets that include the - target IP and will preserve the source address if it is from - such subnet. If there is no such subnet we select source - address according to the rules for level 2. - 2 - Always use the best local address for this target. - In this mode we ignore the source address in the IP packet - and try to select local address that we prefer for talks with - the target host. Such local address is selected by looking - for primary IP addresses on all our subnets on the outgoing - interface that include the target IP address. If no suitable - local address is found we select the first local address - we have on the outgoing interface or on all other interfaces, - with the hope we will receive reply for our request and - even sometimes no matter the source IP address we announce. - - The max value from conf/{all,interface}/arp_announce is used. - - Increasing the restriction level gives more chance for - receiving answer from the resolved target while decreasing - the level announces more valid sender's information. - -arp_ignore - INTEGER - Define different modes for sending replies in response to - received ARP requests that resolve local target IP addresses: - 0 - (default): reply for any local target IP address, configured - on any interface - 1 - reply only if the target IP address is local address - configured on the incoming interface - 2 - reply only if the target IP address is local address - configured on the incoming interface and both with the - sender's IP address are part from same subnet on this interface - 3 - do not reply for local addresses configured with scope host, - only resolutions for global and link addresses are replied - 4-7 - reserved - 8 - do not reply for all local addresses - - The max value from conf/{all,interface}/arp_ignore is used - when ARP request is received on the {interface} - -arp_notify - BOOLEAN - Define mode for notification of address and device changes. - 0 - (default): do nothing - 1 - Generate gratuitous arp requests when device is brought up - or hardware address changes. - -arp_accept - BOOLEAN - Define behavior for gratuitous ARP frames who's IP is not - already present in the ARP table: - 0 - don't create new entries in the ARP table - 1 - create new entries in the ARP table - - Both replies and requests type gratuitous arp will trigger the - ARP table to be updated, if this setting is on. - - If the ARP table already contains the IP address of the - gratuitous arp frame, the arp table will be updated regardless - if this setting is on or off. - -mcast_solicit - INTEGER - The maximum number of multicast probes in INCOMPLETE state, - when the associated hardware address is unknown. Defaults - to 3. - -ucast_solicit - INTEGER - The maximum number of unicast probes in PROBE state, when - the hardware address is being reconfirmed. Defaults to 3. - -app_solicit - INTEGER - The maximum number of probes to send to the user space ARP daemon - via netlink before dropping back to multicast probes (see - mcast_resolicit). Defaults to 0. - -mcast_resolicit - INTEGER - The maximum number of multicast probes after unicast and - app probes in PROBE state. Defaults to 0. - -disable_policy - BOOLEAN - Disable IPSEC policy (SPD) for this interface - -disable_xfrm - BOOLEAN - Disable IPSEC encryption on this interface, whatever the policy - -igmpv2_unsolicited_report_interval - INTEGER - The interval in milliseconds in which the next unsolicited - IGMPv1 or IGMPv2 report retransmit will take place. - Default: 10000 (10 seconds) - -igmpv3_unsolicited_report_interval - INTEGER - The interval in milliseconds in which the next unsolicited - IGMPv3 report retransmit will take place. - Default: 1000 (1 seconds) - -promote_secondaries - BOOLEAN - When a primary IP address is removed from this interface - promote a corresponding secondary IP address instead of - removing all the corresponding secondary IP addresses. - -drop_unicast_in_l2_multicast - BOOLEAN - Drop any unicast IP packets that are received in link-layer - multicast (or broadcast) frames. - This behavior (for multicast) is actually a SHOULD in RFC - 1122, but is disabled by default for compatibility reasons. - Default: off (0) - -drop_gratuitous_arp - BOOLEAN - Drop all gratuitous ARP frames, for example if there's a known - good ARP proxy on the network and such frames need not be used - (or in the case of 802.11, must not be used to prevent attacks.) - Default: off (0) - - -tag - INTEGER - Allows you to write a number, which can be used as required. - Default value is 0. - -xfrm4_gc_thresh - INTEGER - (Obsolete since linux-4.14) - The threshold at which we will start garbage collecting for IPv4 - destination cache entries. At twice this value the system will - refuse new allocations. - -igmp_link_local_mcast_reports - BOOLEAN - Enable IGMP reports for link local multicast groups in the - 224.0.0.X range. - Default TRUE - -Alexey Kuznetsov. -kuznet@ms2.inr.ac.ru - -Updated by: -Andi Kleen -ak@muc.de -Nicolas Delon -delon.nicolas@wanadoo.fr - - - - -/proc/sys/net/ipv6/* Variables: - -IPv6 has no global variables such as tcp_*. tcp_* settings under ipv4/ also -apply to IPv6 [XXX?]. - -bindv6only - BOOLEAN - Default value for IPV6_V6ONLY socket option, - which restricts use of the IPv6 socket to IPv6 communication - only. - TRUE: disable IPv4-mapped address feature - FALSE: enable IPv4-mapped address feature - - Default: FALSE (as specified in RFC3493) - -flowlabel_consistency - BOOLEAN - Protect the consistency (and unicity) of flow label. - You have to disable it to use IPV6_FL_F_REFLECT flag on the - flow label manager. - TRUE: enabled - FALSE: disabled - Default: TRUE - -auto_flowlabels - INTEGER - Automatically generate flow labels based on a flow hash of the - packet. This allows intermediate devices, such as routers, to - identify packet flows for mechanisms like Equal Cost Multipath - Routing (see RFC 6438). - 0: automatic flow labels are completely disabled - 1: automatic flow labels are enabled by default, they can be - disabled on a per socket basis using the IPV6_AUTOFLOWLABEL - socket option - 2: automatic flow labels are allowed, they may be enabled on a - per socket basis using the IPV6_AUTOFLOWLABEL socket option - 3: automatic flow labels are enabled and enforced, they cannot - be disabled by the socket option - Default: 1 - -flowlabel_state_ranges - BOOLEAN - Split the flow label number space into two ranges. 0-0x7FFFF is - reserved for the IPv6 flow manager facility, 0x80000-0xFFFFF - is reserved for stateless flow labels as described in RFC6437. - TRUE: enabled - FALSE: disabled - Default: true - -flowlabel_reflect - INTEGER - Control flow label reflection. Needed for Path MTU - Discovery to work with Equal Cost Multipath Routing in anycast - environments. See RFC 7690 and: - https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01 - - This is a bitmask. - 1: enabled for established flows - - Note that this prevents automatic flowlabel changes, as done - in "tcp: change IPv6 flow-label upon receiving spurious retransmission" - and "tcp: Change txhash on every SYN and RTO retransmit" - - 2: enabled for TCP RESET packets (no active listener) - If set, a RST packet sent in response to a SYN packet on a closed - port will reflect the incoming flow label. - - 4: enabled for ICMPv6 echo reply messages. - - Default: 0 - -fib_multipath_hash_policy - INTEGER - Controls which hash policy to use for multipath routes. - Default: 0 (Layer 3) - Possible values: - 0 - Layer 3 (source and destination addresses plus flow label) - 1 - Layer 4 (standard 5-tuple) - 2 - Layer 3 or inner Layer 3 if present - -anycast_src_echo_reply - BOOLEAN - Controls the use of anycast addresses as source addresses for ICMPv6 - echo reply - TRUE: enabled - FALSE: disabled - Default: FALSE - -idgen_delay - INTEGER - Controls the delay in seconds after which time to retry - privacy stable address generation if a DAD conflict is - detected. - Default: 1 (as specified in RFC7217) - -idgen_retries - INTEGER - Controls the number of retries to generate a stable privacy - address if a DAD conflict is detected. - Default: 3 (as specified in RFC7217) - -mld_qrv - INTEGER - Controls the MLD query robustness variable (see RFC3810 9.1). - Default: 2 (as specified by RFC3810 9.1) - Minimum: 1 (as specified by RFC6636 4.5) - -max_dst_opts_number - INTEGER - Maximum number of non-padding TLVs allowed in a Destination - options extension header. If this value is less than zero - then unknown options are disallowed and the number of known - TLVs allowed is the absolute value of this number. - Default: 8 - -max_hbh_opts_number - INTEGER - Maximum number of non-padding TLVs allowed in a Hop-by-Hop - options extension header. If this value is less than zero - then unknown options are disallowed and the number of known - TLVs allowed is the absolute value of this number. - Default: 8 - -max_dst_opts_length - INTEGER - Maximum length allowed for a Destination options extension - header. - Default: INT_MAX (unlimited) - -max_hbh_length - INTEGER - Maximum length allowed for a Hop-by-Hop options extension - header. - Default: INT_MAX (unlimited) - -skip_notify_on_dev_down - BOOLEAN - Controls whether an RTM_DELROUTE message is generated for routes - removed when a device is taken down or deleted. IPv4 does not - generate this message; IPv6 does by default. Setting this sysctl - to true skips the message, making IPv4 and IPv6 on par in relying - on userspace caches to track link events and evict routes. - Default: false (generate message) - -nexthop_compat_mode - BOOLEAN - New nexthop API provides a means for managing nexthops independent of - prefixes. Backwards compatibilty with old route format is enabled by - default which means route dumps and notifications contain the new - nexthop attribute but also the full, expanded nexthop definition. - Further, updates or deletes of a nexthop configuration generate route - notifications for each fib entry using the nexthop. Once a system - understands the new API, this sysctl can be disabled to achieve full - performance benefits of the new API by disabling the nexthop expansion - and extraneous notifications. - Default: true (backward compat mode) - -IPv6 Fragmentation: - -ip6frag_high_thresh - INTEGER - Maximum memory used to reassemble IPv6 fragments. When - ip6frag_high_thresh bytes of memory is allocated for this purpose, - the fragment handler will toss packets until ip6frag_low_thresh - is reached. - -ip6frag_low_thresh - INTEGER - See ip6frag_high_thresh - -ip6frag_time - INTEGER - Time in seconds to keep an IPv6 fragment in memory. - -IPv6 Segment Routing: - -seg6_flowlabel - INTEGER - Controls the behaviour of computing the flowlabel of outer - IPv6 header in case of SR T.encaps - - -1 set flowlabel to zero. - 0 copy flowlabel from Inner packet in case of Inner IPv6 - (Set flowlabel to 0 in case IPv4/L2) - 1 Compute the flowlabel using seg6_make_flowlabel() - - Default is 0. - -conf/default/*: - Change the interface-specific default settings. - - -conf/all/*: - Change all the interface-specific settings. - - [XXX: Other special features than forwarding?] - -conf/all/forwarding - BOOLEAN - Enable global IPv6 forwarding between all interfaces. - - IPv4 and IPv6 work differently here; e.g. netfilter must be used - to control which interfaces may forward packets and which not. - - This also sets all interfaces' Host/Router setting - 'forwarding' to the specified value. See below for details. - - This referred to as global forwarding. - -proxy_ndp - BOOLEAN - Do proxy ndp. - -fwmark_reflect - BOOLEAN - Controls the fwmark of kernel-generated IPv6 reply packets that are not - associated with a socket for example, TCP RSTs or ICMPv6 echo replies). - If unset, these packets have a fwmark of zero. If set, they have the - fwmark of the packet they are replying to. - Default: 0 - -conf/interface/*: - Change special settings per interface. - - The functional behaviour for certain settings is different - depending on whether local forwarding is enabled or not. - -accept_ra - INTEGER - Accept Router Advertisements; autoconfigure using them. - - It also determines whether or not to transmit Router - Solicitations. If and only if the functional setting is to - accept Router Advertisements, Router Solicitations will be - transmitted. - - Possible values are: - 0 Do not accept Router Advertisements. - 1 Accept Router Advertisements if forwarding is disabled. - 2 Overrule forwarding behaviour. Accept Router Advertisements - even if forwarding is enabled. - - Functional default: enabled if local forwarding is disabled. - disabled if local forwarding is enabled. - -accept_ra_defrtr - BOOLEAN - Learn default router in Router Advertisement. - - Functional default: enabled if accept_ra is enabled. - disabled if accept_ra is disabled. - -accept_ra_from_local - BOOLEAN - Accept RA with source-address that is found on local machine - if the RA is otherwise proper and able to be accepted. - Default is to NOT accept these as it may be an un-intended - network loop. - - Functional default: - enabled if accept_ra_from_local is enabled - on a specific interface. - disabled if accept_ra_from_local is disabled - on a specific interface. - -accept_ra_min_hop_limit - INTEGER - Minimum hop limit Information in Router Advertisement. - - Hop limit Information in Router Advertisement less than this - variable shall be ignored. - - Default: 1 - -accept_ra_pinfo - BOOLEAN - Learn Prefix Information in Router Advertisement. - - Functional default: enabled if accept_ra is enabled. - disabled if accept_ra is disabled. - -accept_ra_rt_info_min_plen - INTEGER - Minimum prefix length of Route Information in RA. - - Route Information w/ prefix smaller than this variable shall - be ignored. - - Functional default: 0 if accept_ra_rtr_pref is enabled. - -1 if accept_ra_rtr_pref is disabled. - -accept_ra_rt_info_max_plen - INTEGER - Maximum prefix length of Route Information in RA. - - Route Information w/ prefix larger than this variable shall - be ignored. - - Functional default: 0 if accept_ra_rtr_pref is enabled. - -1 if accept_ra_rtr_pref is disabled. - -accept_ra_rtr_pref - BOOLEAN - Accept Router Preference in RA. - - Functional default: enabled if accept_ra is enabled. - disabled if accept_ra is disabled. - -accept_ra_mtu - BOOLEAN - Apply the MTU value specified in RA option 5 (RFC4861). If - disabled, the MTU specified in the RA will be ignored. - - Functional default: enabled if accept_ra is enabled. - disabled if accept_ra is disabled. - -accept_redirects - BOOLEAN - Accept Redirects. - - Functional default: enabled if local forwarding is disabled. - disabled if local forwarding is enabled. - -accept_source_route - INTEGER - Accept source routing (routing extension header). - - >= 0: Accept only routing header type 2. - < 0: Do not accept routing header. - - Default: 0 - -autoconf - BOOLEAN - Autoconfigure addresses using Prefix Information in Router - Advertisements. - - Functional default: enabled if accept_ra_pinfo is enabled. - disabled if accept_ra_pinfo is disabled. - -dad_transmits - INTEGER - The amount of Duplicate Address Detection probes to send. - Default: 1 - -forwarding - INTEGER - Configure interface-specific Host/Router behaviour. - - Note: It is recommended to have the same setting on all - interfaces; mixed router/host scenarios are rather uncommon. - - Possible values are: - 0 Forwarding disabled - 1 Forwarding enabled - - FALSE (0): - - By default, Host behaviour is assumed. This means: - - 1. IsRouter flag is not set in Neighbour Advertisements. - 2. If accept_ra is TRUE (default), transmit Router - Solicitations. - 3. If accept_ra is TRUE (default), accept Router - Advertisements (and do autoconfiguration). - 4. If accept_redirects is TRUE (default), accept Redirects. - - TRUE (1): - - If local forwarding is enabled, Router behaviour is assumed. - This means exactly the reverse from the above: - - 1. IsRouter flag is set in Neighbour Advertisements. - 2. Router Solicitations are not sent unless accept_ra is 2. - 3. Router Advertisements are ignored unless accept_ra is 2. - 4. Redirects are ignored. - - Default: 0 (disabled) if global forwarding is disabled (default), - otherwise 1 (enabled). - -hop_limit - INTEGER - Default Hop Limit to set. - Default: 64 - -mtu - INTEGER - Default Maximum Transfer Unit - Default: 1280 (IPv6 required minimum) - -ip_nonlocal_bind - BOOLEAN - If set, allows processes to bind() to non-local IPv6 addresses, - which can be quite useful - but may break some applications. - Default: 0 - -router_probe_interval - INTEGER - Minimum interval (in seconds) between Router Probing described - in RFC4191. - - Default: 60 - -router_solicitation_delay - INTEGER - Number of seconds to wait after interface is brought up - before sending Router Solicitations. - Default: 1 - -router_solicitation_interval - INTEGER - Number of seconds to wait between Router Solicitations. - Default: 4 - -router_solicitations - INTEGER - Number of Router Solicitations to send until assuming no - routers are present. - Default: 3 - -use_oif_addrs_only - BOOLEAN - When enabled, the candidate source addresses for destinations - routed via this interface are restricted to the set of addresses - configured on this interface (vis. RFC 6724, section 4). - - Default: false - -use_tempaddr - INTEGER - Preference for Privacy Extensions (RFC3041). - <= 0 : disable Privacy Extensions - == 1 : enable Privacy Extensions, but prefer public - addresses over temporary addresses. - > 1 : enable Privacy Extensions and prefer temporary - addresses over public addresses. - Default: 0 (for most devices) - -1 (for point-to-point devices and loopback devices) - -temp_valid_lft - INTEGER - valid lifetime (in seconds) for temporary addresses. - Default: 604800 (7 days) - -temp_prefered_lft - INTEGER - Preferred lifetime (in seconds) for temporary addresses. - Default: 86400 (1 day) - -keep_addr_on_down - INTEGER - Keep all IPv6 addresses on an interface down event. If set static - global addresses with no expiration time are not flushed. - >0 : enabled - 0 : system default - <0 : disabled - - Default: 0 (addresses are removed) - -max_desync_factor - INTEGER - Maximum value for DESYNC_FACTOR, which is a random value - that ensures that clients don't synchronize with each - other and generate new addresses at exactly the same time. - value is in seconds. - Default: 600 - -regen_max_retry - INTEGER - Number of attempts before give up attempting to generate - valid temporary addresses. - Default: 5 - -max_addresses - INTEGER - Maximum number of autoconfigured addresses per interface. Setting - to zero disables the limitation. It is not recommended to set this - value too large (or to zero) because it would be an easy way to - crash the kernel by allowing too many addresses to be created. - Default: 16 - -disable_ipv6 - BOOLEAN - Disable IPv6 operation. If accept_dad is set to 2, this value - will be dynamically set to TRUE if DAD fails for the link-local - address. - Default: FALSE (enable IPv6 operation) - - When this value is changed from 1 to 0 (IPv6 is being enabled), - it will dynamically create a link-local address on the given - interface and start Duplicate Address Detection, if necessary. - - When this value is changed from 0 to 1 (IPv6 is being disabled), - it will dynamically delete all addresses and routes on the given - interface. From now on it will not possible to add addresses/routes - to the selected interface. - -accept_dad - INTEGER - Whether to accept DAD (Duplicate Address Detection). - 0: Disable DAD - 1: Enable DAD (default) - 2: Enable DAD, and disable IPv6 operation if MAC-based duplicate - link-local address has been found. - - DAD operation and mode on a given interface will be selected according - to the maximum value of conf/{all,interface}/accept_dad. - -force_tllao - BOOLEAN - Enable sending the target link-layer address option even when - responding to a unicast neighbor solicitation. - Default: FALSE - - Quoting from RFC 2461, section 4.4, Target link-layer address: - - "The option MUST be included for multicast solicitations in order to - avoid infinite Neighbor Solicitation "recursion" when the peer node - does not have a cache entry to return a Neighbor Advertisements - message. When responding to unicast solicitations, the option can be - omitted since the sender of the solicitation has the correct link- - layer address; otherwise it would not have be able to send the unicast - solicitation in the first place. However, including the link-layer - address in this case adds little overhead and eliminates a potential - race condition where the sender deletes the cached link-layer address - prior to receiving a response to a previous solicitation." - -ndisc_notify - BOOLEAN - Define mode for notification of address and device changes. - 0 - (default): do nothing - 1 - Generate unsolicited neighbour advertisements when device is brought - up or hardware address changes. - -ndisc_tclass - INTEGER - The IPv6 Traffic Class to use by default when sending IPv6 Neighbor - Discovery (Router Solicitation, Router Advertisement, Neighbor - Solicitation, Neighbor Advertisement, Redirect) messages. - These 8 bits can be interpreted as 6 high order bits holding the DSCP - value and 2 low order bits representing ECN (which you probably want - to leave cleared). - 0 - (default) - -mldv1_unsolicited_report_interval - INTEGER - The interval in milliseconds in which the next unsolicited - MLDv1 report retransmit will take place. - Default: 10000 (10 seconds) - -mldv2_unsolicited_report_interval - INTEGER - The interval in milliseconds in which the next unsolicited - MLDv2 report retransmit will take place. - Default: 1000 (1 second) - -force_mld_version - INTEGER - 0 - (default) No enforcement of a MLD version, MLDv1 fallback allowed - 1 - Enforce to use MLD version 1 - 2 - Enforce to use MLD version 2 - -suppress_frag_ndisc - INTEGER - Control RFC 6980 (Security Implications of IPv6 Fragmentation - with IPv6 Neighbor Discovery) behavior: - 1 - (default) discard fragmented neighbor discovery packets - 0 - allow fragmented neighbor discovery packets - -optimistic_dad - BOOLEAN - Whether to perform Optimistic Duplicate Address Detection (RFC 4429). - 0: disabled (default) - 1: enabled - - Optimistic Duplicate Address Detection for the interface will be enabled - if at least one of conf/{all,interface}/optimistic_dad is set to 1, - it will be disabled otherwise. - -use_optimistic - BOOLEAN - If enabled, do not classify optimistic addresses as deprecated during - source address selection. Preferred addresses will still be chosen - before optimistic addresses, subject to other ranking in the source - address selection algorithm. - 0: disabled (default) - 1: enabled - - This will be enabled if at least one of - conf/{all,interface}/use_optimistic is set to 1, disabled otherwise. - -stable_secret - IPv6 address - This IPv6 address will be used as a secret to generate IPv6 - addresses for link-local addresses and autoconfigured - ones. All addresses generated after setting this secret will - be stable privacy ones by default. This can be changed via the - addrgenmode ip-link. conf/default/stable_secret is used as the - secret for the namespace, the interface specific ones can - overwrite that. Writes to conf/all/stable_secret are refused. - - It is recommended to generate this secret during installation - of a system and keep it stable after that. - - By default the stable secret is unset. - -addr_gen_mode - INTEGER - Defines how link-local and autoconf addresses are generated. - - 0: generate address based on EUI64 (default) - 1: do no generate a link-local address, use EUI64 for addresses generated - from autoconf - 2: generate stable privacy addresses, using the secret from - stable_secret (RFC7217) - 3: generate stable privacy addresses, using a random secret if unset - -drop_unicast_in_l2_multicast - BOOLEAN - Drop any unicast IPv6 packets that are received in link-layer - multicast (or broadcast) frames. - - By default this is turned off. - -drop_unsolicited_na - BOOLEAN - Drop all unsolicited neighbor advertisements, for example if there's - a known good NA proxy on the network and such frames need not be used - (or in the case of 802.11, must not be used to prevent attacks.) - - By default this is turned off. - -enhanced_dad - BOOLEAN - Include a nonce option in the IPv6 neighbor solicitation messages used for - duplicate address detection per RFC7527. A received DAD NS will only signal - a duplicate address if the nonce is different. This avoids any false - detection of duplicates due to loopback of the NS messages that we send. - The nonce option will be sent on an interface unless both of - conf/{all,interface}/enhanced_dad are set to FALSE. - Default: TRUE - -icmp/*: -ratelimit - INTEGER - Limit the maximal rates for sending ICMPv6 messages. - 0 to disable any limiting, - otherwise the minimal space between responses in milliseconds. - Default: 1000 - -ratemask - list of comma separated ranges - For ICMPv6 message types matching the ranges in the ratemask, limit - the sending of the message according to ratelimit parameter. - - The format used for both input and output is a comma separated - list of ranges (e.g. "0-127,129" for ICMPv6 message type 0 to 127 and - 129). Writing to the file will clear all previous ranges of ICMPv6 - message types and update the current list with the input. - - Refer to: https://www.iana.org/assignments/icmpv6-parameters/icmpv6-parameters.xhtml - for numerical values of ICMPv6 message types, e.g. echo request is 128 - and echo reply is 129. - - Default: 0-1,3-127 (rate limit ICMPv6 errors except Packet Too Big) - -echo_ignore_all - BOOLEAN - If set non-zero, then the kernel will ignore all ICMP ECHO - requests sent to it over the IPv6 protocol. - Default: 0 - -echo_ignore_multicast - BOOLEAN - If set non-zero, then the kernel will ignore all ICMP ECHO - requests sent to it over the IPv6 protocol via multicast. - Default: 0 - -echo_ignore_anycast - BOOLEAN - If set non-zero, then the kernel will ignore all ICMP ECHO - requests sent to it over the IPv6 protocol destined to anycast address. - Default: 0 - -xfrm6_gc_thresh - INTEGER - (Obsolete since linux-4.14) - The threshold at which we will start garbage collecting for IPv6 - destination cache entries. At twice this value the system will - refuse new allocations. - - -IPv6 Update by: -Pekka Savola -YOSHIFUJI Hideaki / USAGI Project - - -/proc/sys/net/bridge/* Variables: - -bridge-nf-call-arptables - BOOLEAN - 1 : pass bridged ARP traffic to arptables' FORWARD chain. - 0 : disable this. - Default: 1 - -bridge-nf-call-iptables - BOOLEAN - 1 : pass bridged IPv4 traffic to iptables' chains. - 0 : disable this. - Default: 1 - -bridge-nf-call-ip6tables - BOOLEAN - 1 : pass bridged IPv6 traffic to ip6tables' chains. - 0 : disable this. - Default: 1 - -bridge-nf-filter-vlan-tagged - BOOLEAN - 1 : pass bridged vlan-tagged ARP/IP/IPv6 traffic to {arp,ip,ip6}tables. - 0 : disable this. - Default: 0 - -bridge-nf-filter-pppoe-tagged - BOOLEAN - 1 : pass bridged pppoe-tagged IP/IPv6 traffic to {ip,ip6}tables. - 0 : disable this. - Default: 0 - -bridge-nf-pass-vlan-input-dev - BOOLEAN - 1: if bridge-nf-filter-vlan-tagged is enabled, try to find a vlan - interface on the bridge and set the netfilter input device to the vlan. - This allows use of e.g. "iptables -i br0.1" and makes the REDIRECT - target work with vlan-on-top-of-bridge interfaces. When no matching - vlan interface is found, or this switch is off, the input device is - set to the bridge interface. - 0: disable bridge netfilter vlan interface lookup. - Default: 0 - -proc/sys/net/sctp/* Variables: - -addip_enable - BOOLEAN - Enable or disable extension of Dynamic Address Reconfiguration - (ADD-IP) functionality specified in RFC5061. This extension provides - the ability to dynamically add and remove new addresses for the SCTP - associations. - - 1: Enable extension. - - 0: Disable extension. - - Default: 0 - -pf_enable - INTEGER - Enable or disable pf (pf is short for potentially failed) state. A value - of pf_retrans > path_max_retrans also disables pf state. That is, one of - both pf_enable and pf_retrans > path_max_retrans can disable pf state. - Since pf_retrans and path_max_retrans can be changed by userspace - application, sometimes user expects to disable pf state by the value of - pf_retrans > path_max_retrans, but occasionally the value of pf_retrans - or path_max_retrans is changed by the user application, this pf state is - enabled. As such, it is necessary to add this to dynamically enable - and disable pf state. See: - https://datatracker.ietf.org/doc/draft-ietf-tsvwg-sctp-failover for - details. - - 1: Enable pf. - - 0: Disable pf. - - Default: 1 - -pf_expose - INTEGER - Unset or enable/disable pf (pf is short for potentially failed) state - exposure. Applications can control the exposure of the PF path state - in the SCTP_PEER_ADDR_CHANGE event and the SCTP_GET_PEER_ADDR_INFO - sockopt. When it's unset, no SCTP_PEER_ADDR_CHANGE event with - SCTP_ADDR_PF state will be sent and a SCTP_PF-state transport info - can be got via SCTP_GET_PEER_ADDR_INFO sockopt; When it's enabled, - a SCTP_PEER_ADDR_CHANGE event will be sent for a transport becoming - SCTP_PF state and a SCTP_PF-state transport info can be got via - SCTP_GET_PEER_ADDR_INFO sockopt; When it's diabled, no - SCTP_PEER_ADDR_CHANGE event will be sent and it returns -EACCES when - trying to get a SCTP_PF-state transport info via SCTP_GET_PEER_ADDR_INFO - sockopt. - - 0: Unset pf state exposure, Compatible with old applications. - - 1: Disable pf state exposure. - - 2: Enable pf state exposure. - - Default: 0 - -addip_noauth_enable - BOOLEAN - Dynamic Address Reconfiguration (ADD-IP) requires the use of - authentication to protect the operations of adding or removing new - addresses. This requirement is mandated so that unauthorized hosts - would not be able to hijack associations. However, older - implementations may not have implemented this requirement while - allowing the ADD-IP extension. For reasons of interoperability, - we provide this variable to control the enforcement of the - authentication requirement. - - 1: Allow ADD-IP extension to be used without authentication. This - should only be set in a closed environment for interoperability - with older implementations. - - 0: Enforce the authentication requirement - - Default: 0 - -auth_enable - BOOLEAN - Enable or disable Authenticated Chunks extension. This extension - provides the ability to send and receive authenticated chunks and is - required for secure operation of Dynamic Address Reconfiguration - (ADD-IP) extension. - - 1: Enable this extension. - 0: Disable this extension. - - Default: 0 - -prsctp_enable - BOOLEAN - Enable or disable the Partial Reliability extension (RFC3758) which - is used to notify peers that a given DATA should no longer be expected. - - 1: Enable extension - 0: Disable - - Default: 1 - -max_burst - INTEGER - The limit of the number of new packets that can be initially sent. It - controls how bursty the generated traffic can be. - - Default: 4 - -association_max_retrans - INTEGER - Set the maximum number for retransmissions that an association can - attempt deciding that the remote end is unreachable. If this value - is exceeded, the association is terminated. - - Default: 10 - -max_init_retransmits - INTEGER - The maximum number of retransmissions of INIT and COOKIE-ECHO chunks - that an association will attempt before declaring the destination - unreachable and terminating. - - Default: 8 - -path_max_retrans - INTEGER - The maximum number of retransmissions that will be attempted on a given - path. Once this threshold is exceeded, the path is considered - unreachable, and new traffic will use a different path when the - association is multihomed. - - Default: 5 - -pf_retrans - INTEGER - The number of retransmissions that will be attempted on a given path - before traffic is redirected to an alternate transport (should one - exist). Note this is distinct from path_max_retrans, as a path that - passes the pf_retrans threshold can still be used. Its only - deprioritized when a transmission path is selected by the stack. This - setting is primarily used to enable fast failover mechanisms without - having to reduce path_max_retrans to a very low value. See: - http://www.ietf.org/id/draft-nishida-tsvwg-sctp-failover-05.txt - for details. Note also that a value of pf_retrans > path_max_retrans - disables this feature. Since both pf_retrans and path_max_retrans can - be changed by userspace application, a variable pf_enable is used to - disable pf state. - - Default: 0 - -ps_retrans - INTEGER - Primary.Switchover.Max.Retrans (PSMR), it's a tunable parameter coming - from section-5 "Primary Path Switchover" in rfc7829. The primary path - will be changed to another active path when the path error counter on - the old primary path exceeds PSMR, so that "the SCTP sender is allowed - to continue data transmission on a new working path even when the old - primary destination address becomes active again". Note this feature - is disabled by initializing 'ps_retrans' per netns as 0xffff by default, - and its value can't be less than 'pf_retrans' when changing by sysctl. - - Default: 0xffff - -rto_initial - INTEGER - The initial round trip timeout value in milliseconds that will be used - in calculating round trip times. This is the initial time interval - for retransmissions. - - Default: 3000 - -rto_max - INTEGER - The maximum value (in milliseconds) of the round trip timeout. This - is the largest time interval that can elapse between retransmissions. - - Default: 60000 - -rto_min - INTEGER - The minimum value (in milliseconds) of the round trip timeout. This - is the smallest time interval the can elapse between retransmissions. - - Default: 1000 - -hb_interval - INTEGER - The interval (in milliseconds) between HEARTBEAT chunks. These chunks - are sent at the specified interval on idle paths to probe the state of - a given path between 2 associations. - - Default: 30000 - -sack_timeout - INTEGER - The amount of time (in milliseconds) that the implementation will wait - to send a SACK. - - Default: 200 - -valid_cookie_life - INTEGER - The default lifetime of the SCTP cookie (in milliseconds). The cookie - is used during association establishment. - - Default: 60000 - -cookie_preserve_enable - BOOLEAN - Enable or disable the ability to extend the lifetime of the SCTP cookie - that is used during the establishment phase of SCTP association - - 1: Enable cookie lifetime extension. - 0: Disable - - Default: 1 - -cookie_hmac_alg - STRING - Select the hmac algorithm used when generating the cookie value sent by - a listening sctp socket to a connecting client in the INIT-ACK chunk. - Valid values are: - * md5 - * sha1 - * none - Ability to assign md5 or sha1 as the selected alg is predicated on the - configuration of those algorithms at build time (CONFIG_CRYPTO_MD5 and - CONFIG_CRYPTO_SHA1). - - Default: Dependent on configuration. MD5 if available, else SHA1 if - available, else none. - -rcvbuf_policy - INTEGER - Determines if the receive buffer is attributed to the socket or to - association. SCTP supports the capability to create multiple - associations on a single socket. When using this capability, it is - possible that a single stalled association that's buffering a lot - of data may block other associations from delivering their data by - consuming all of the receive buffer space. To work around this, - the rcvbuf_policy could be set to attribute the receiver buffer space - to each association instead of the socket. This prevents the described - blocking. - - 1: rcvbuf space is per association - 0: rcvbuf space is per socket - - Default: 0 - -sndbuf_policy - INTEGER - Similar to rcvbuf_policy above, this applies to send buffer space. - - 1: Send buffer is tracked per association - 0: Send buffer is tracked per socket. - - Default: 0 - -sctp_mem - vector of 3 INTEGERs: min, pressure, max - Number of pages allowed for queueing by all SCTP sockets. - - min: Below this number of pages SCTP is not bothered about its - memory appetite. When amount of memory allocated by SCTP exceeds - this number, SCTP starts to moderate memory usage. - - pressure: This value was introduced to follow format of tcp_mem. - - max: Number of pages allowed for queueing by all SCTP sockets. - - Default is calculated at boot time from amount of available memory. - -sctp_rmem - vector of 3 INTEGERs: min, default, max - Only the first value ("min") is used, "default" and "max" are - ignored. - - min: Minimal size of receive buffer used by SCTP socket. - It is guaranteed to each SCTP socket (but not association) even - under moderate memory pressure. - - Default: 4K - -sctp_wmem - vector of 3 INTEGERs: min, default, max - Currently this tunable has no effect. - -addr_scope_policy - INTEGER - Control IPv4 address scoping - draft-stewart-tsvwg-sctp-ipv4-00 - - 0 - Disable IPv4 address scoping - 1 - Enable IPv4 address scoping - 2 - Follow draft but allow IPv4 private addresses - 3 - Follow draft but allow IPv4 link local addresses - - Default: 1 - - -/proc/sys/net/core/* - Please see: Documentation/admin-guide/sysctl/net.rst for descriptions of these entries. - - -/proc/sys/net/unix/* -max_dgram_qlen - INTEGER - The maximum length of dgram socket receive queue - - Default: 10 - diff --git a/Documentation/networking/snmp_counter.rst b/Documentation/networking/snmp_counter.rst index 10e11099e74a..4edd0d38779e 100644 --- a/Documentation/networking/snmp_counter.rst +++ b/Documentation/networking/snmp_counter.rst @@ -792,7 +792,7 @@ counters to indicate the ACK is skipped in which scenario. The ACK would only be skipped if the received packet is either a SYN packet or it has no data. -.. _sysctl document: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt +.. _sysctl document: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.rst * TcpExtTCPACKSkippedSynRecv diff --git a/net/Kconfig b/net/Kconfig index df8d8c9bd021..8b1f85820a6b 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -86,7 +86,7 @@ config INET "Sysctl support" below, you can change various aspects of the behavior of the TCP/IP code by writing to the (virtual) files in /proc/sys/net/ipv4/*; the options are explained in the file - . + . Short answer: say Y. diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig index 25a8888826b8..5da4733067fb 100644 --- a/net/ipv4/Kconfig +++ b/net/ipv4/Kconfig @@ -49,7 +49,7 @@ config IP_ADVANCED_ROUTER Note that some distributions enable it in startup scripts. For details about rp_filter strict and loose mode read - . + . If unsure, say N here. diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c index fc61f51d87a3..956a806649f7 100644 --- a/net/ipv4/icmp.c +++ b/net/ipv4/icmp.c @@ -853,7 +853,7 @@ static bool icmp_unreach(struct sk_buff *skb) case ICMP_FRAG_NEEDED: /* for documentation of the ip_no_pmtu_disc * values please see - * Documentation/networking/ip-sysctl.txt + * Documentation/networking/ip-sysctl.rst */ switch (net->ipv4.sysctl_ip_no_pmtu_disc) { default: -- cgit v1.2.3 From 19093313cb0486d568232934bb80dd422d891623 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:50 +0200 Subject: docs: networking: convert ipv6.txt to ReST Not much to be done here: - add SPDX header; - add a document title; - mark a literal as such, in order to avoid a warning; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/admin-guide/kernel-parameters.txt | 6 +- Documentation/networking/index.rst | 1 + Documentation/networking/ipv6.rst | 78 +++++++++++++++++++++++++ Documentation/networking/ipv6.txt | 72 ----------------------- net/ipv6/Kconfig | 2 +- 5 files changed, 83 insertions(+), 76 deletions(-) create mode 100644 Documentation/networking/ipv6.rst delete mode 100644 Documentation/networking/ipv6.txt (limited to 'Documentation') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index e37db6f1be64..e43f2e1f2958 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -356,7 +356,7 @@ shot down by NMI autoconf= [IPV6] - See Documentation/networking/ipv6.txt. + See Documentation/networking/ipv6.rst. show_lapic= [APIC,X86] Advanced Programmable Interrupt Controller Limit apic dumping. The parameter defines the maximal @@ -872,7 +872,7 @@ miss to occur. disable= [IPV6] - See Documentation/networking/ipv6.txt. + See Documentation/networking/ipv6.rst. hardened_usercopy= [KNL] Under CONFIG_HARDENED_USERCOPY, whether @@ -912,7 +912,7 @@ to workaround buggy firmware. disable_ipv6= [IPV6] - See Documentation/networking/ipv6.txt. + See Documentation/networking/ipv6.rst. disable_mtrr_cleanup [X86] The kernel tries to adjust MTRR layout from continuous diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 7d133d8dbe2a..709675464e51 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -70,6 +70,7 @@ Contents: iphase ipsec ip-sysctl + ipv6 .. only:: subproject and html diff --git a/Documentation/networking/ipv6.rst b/Documentation/networking/ipv6.rst new file mode 100644 index 000000000000..ba09c2f2dcc7 --- /dev/null +++ b/Documentation/networking/ipv6.rst @@ -0,0 +1,78 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==== +IPv6 +==== + + +Options for the ipv6 module are supplied as parameters at load time. + +Module options may be given as command line arguments to the insmod +or modprobe command, but are usually specified in either +``/etc/modules.d/*.conf`` configuration files, or in a distro-specific +configuration file. + +The available ipv6 module parameters are listed below. If a parameter +is not specified the default value is used. + +The parameters are as follows: + +disable + + Specifies whether to load the IPv6 module, but disable all + its functionality. This might be used when another module + has a dependency on the IPv6 module being loaded, but no + IPv6 addresses or operations are desired. + + The possible values and their effects are: + + 0 + IPv6 is enabled. + + This is the default value. + + 1 + IPv6 is disabled. + + No IPv6 addresses will be added to interfaces, and + it will not be possible to open an IPv6 socket. + + A reboot is required to enable IPv6. + +autoconf + + Specifies whether to enable IPv6 address autoconfiguration + on all interfaces. This might be used when one does not wish + for addresses to be automatically generated from prefixes + received in Router Advertisements. + + The possible values and their effects are: + + 0 + IPv6 address autoconfiguration is disabled on all interfaces. + + Only the IPv6 loopback address (::1) and link-local addresses + will be added to interfaces. + + 1 + IPv6 address autoconfiguration is enabled on all interfaces. + + This is the default value. + +disable_ipv6 + + Specifies whether to disable IPv6 on all interfaces. + This might be used when no IPv6 addresses are desired. + + The possible values and their effects are: + + 0 + IPv6 is enabled on all interfaces. + + This is the default value. + + 1 + IPv6 is disabled on all interfaces. + + No IPv6 addresses will be added to interfaces. + diff --git a/Documentation/networking/ipv6.txt b/Documentation/networking/ipv6.txt deleted file mode 100644 index 6cd74fa55358..000000000000 --- a/Documentation/networking/ipv6.txt +++ /dev/null @@ -1,72 +0,0 @@ - -Options for the ipv6 module are supplied as parameters at load time. - -Module options may be given as command line arguments to the insmod -or modprobe command, but are usually specified in either -/etc/modules.d/*.conf configuration files, or in a distro-specific -configuration file. - -The available ipv6 module parameters are listed below. If a parameter -is not specified the default value is used. - -The parameters are as follows: - -disable - - Specifies whether to load the IPv6 module, but disable all - its functionality. This might be used when another module - has a dependency on the IPv6 module being loaded, but no - IPv6 addresses or operations are desired. - - The possible values and their effects are: - - 0 - IPv6 is enabled. - - This is the default value. - - 1 - IPv6 is disabled. - - No IPv6 addresses will be added to interfaces, and - it will not be possible to open an IPv6 socket. - - A reboot is required to enable IPv6. - -autoconf - - Specifies whether to enable IPv6 address autoconfiguration - on all interfaces. This might be used when one does not wish - for addresses to be automatically generated from prefixes - received in Router Advertisements. - - The possible values and their effects are: - - 0 - IPv6 address autoconfiguration is disabled on all interfaces. - - Only the IPv6 loopback address (::1) and link-local addresses - will be added to interfaces. - - 1 - IPv6 address autoconfiguration is enabled on all interfaces. - - This is the default value. - -disable_ipv6 - - Specifies whether to disable IPv6 on all interfaces. - This might be used when no IPv6 addresses are desired. - - The possible values and their effects are: - - 0 - IPv6 is enabled on all interfaces. - - This is the default value. - - 1 - IPv6 is disabled on all interfaces. - - No IPv6 addresses will be added to interfaces. - diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig index 2ccaee98fddb..5a6111da26c4 100644 --- a/net/ipv6/Kconfig +++ b/net/ipv6/Kconfig @@ -13,7 +13,7 @@ menuconfig IPV6 For general information about IPv6, see . For specific information about IPv6 under Linux, see - Documentation/networking/ipv6.txt and read the HOWTO at + Documentation/networking/ipv6.rst and read the HOWTO at To compile this protocol support as a module, choose M here: the -- cgit v1.2.3 From 1dc2a785954bf4e562d0c85bea435ee56f705db5 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:51 +0200 Subject: docs: networking: convert ipvlan.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/ipvlan.rst | 189 ++++++++++++++++++++++++++++++++++++ Documentation/networking/ipvlan.txt | 146 ---------------------------- 3 files changed, 190 insertions(+), 146 deletions(-) create mode 100644 Documentation/networking/ipvlan.rst delete mode 100644 Documentation/networking/ipvlan.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 709675464e51..54dee1575b54 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -71,6 +71,7 @@ Contents: ipsec ip-sysctl ipv6 + ipvlan .. only:: subproject and html diff --git a/Documentation/networking/ipvlan.rst b/Documentation/networking/ipvlan.rst new file mode 100644 index 000000000000..694adcba36b0 --- /dev/null +++ b/Documentation/networking/ipvlan.rst @@ -0,0 +1,189 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================== +IPVLAN Driver HOWTO +=================== + +Initial Release: + Mahesh Bandewar + +1. Introduction: +================ +This is conceptually very similar to the macvlan driver with one major +exception of using L3 for mux-ing /demux-ing among slaves. This property makes +the master device share the L2 with it's slave devices. I have developed this +driver in conjunction with network namespaces and not sure if there is use case +outside of it. + + +2. Building and Installation: +============================= + +In order to build the driver, please select the config item CONFIG_IPVLAN. +The driver can be built into the kernel (CONFIG_IPVLAN=y) or as a module +(CONFIG_IPVLAN=m). + + +3. Configuration: +================= + +There are no module parameters for this driver and it can be configured +using IProute2/ip utility. +:: + + ip link add link name type ipvlan [ mode MODE ] [ FLAGS ] + where + MODE: l3 (default) | l3s | l2 + FLAGS: bridge (default) | private | vepa + +e.g. + + (a) Following will create IPvlan link with eth0 as master in + L3 bridge mode:: + + bash# ip link add link eth0 name ipvl0 type ipvlan + (b) This command will create IPvlan link in L2 bridge mode:: + + bash# ip link add link eth0 name ipvl0 type ipvlan mode l2 bridge + + (c) This command will create an IPvlan device in L2 private mode:: + + bash# ip link add link eth0 name ipvlan type ipvlan mode l2 private + + (d) This command will create an IPvlan device in L2 vepa mode:: + + bash# ip link add link eth0 name ipvlan type ipvlan mode l2 vepa + + +4. Operating modes: +=================== + +IPvlan has two modes of operation - L2 and L3. For a given master device, +you can select one of these two modes and all slaves on that master will +operate in the same (selected) mode. The RX mode is almost identical except +that in L3 mode the slaves wont receive any multicast / broadcast traffic. +L3 mode is more restrictive since routing is controlled from the other (mostly) +default namespace. + +4.1 L2 mode: +------------ + +In this mode TX processing happens on the stack instance attached to the +slave device and packets are switched and queued to the master device to send +out. In this mode the slaves will RX/TX multicast and broadcast (if applicable) +as well. + +4.2 L3 mode: +------------ + +In this mode TX processing up to L3 happens on the stack instance attached +to the slave device and packets are switched to the stack instance of the +master device for the L2 processing and routing from that instance will be +used before packets are queued on the outbound device. In this mode the slaves +will not receive nor can send multicast / broadcast traffic. + +4.3 L3S mode: +------------- + +This is very similar to the L3 mode except that iptables (conn-tracking) +works in this mode and hence it is L3-symmetric (L3s). This will have slightly less +performance but that shouldn't matter since you are choosing this mode over plain-L3 +mode to make conn-tracking work. + +5. Mode flags: +============== + +At this time following mode flags are available + +5.1 bridge: +----------- +This is the default option. To configure the IPvlan port in this mode, +user can choose to either add this option on the command-line or don't specify +anything. This is the traditional mode where slaves can cross-talk among +themselves apart from talking through the master device. + +5.2 private: +------------ +If this option is added to the command-line, the port is set in private +mode. i.e. port won't allow cross communication between slaves. + +5.3 vepa: +--------- +If this is added to the command-line, the port is set in VEPA mode. +i.e. port will offload switching functionality to the external entity as +described in 802.1Qbg +Note: VEPA mode in IPvlan has limitations. IPvlan uses the mac-address of the +master-device, so the packets which are emitted in this mode for the adjacent +neighbor will have source and destination mac same. This will make the switch / +router send the redirect message. + +6. What to choose (macvlan vs. ipvlan)? +======================================= + +These two devices are very similar in many regards and the specific use +case could very well define which device to choose. if one of the following +situations defines your use case then you can choose to use ipvlan: + + +(a) The Linux host that is connected to the external switch / router has + policy configured that allows only one mac per port. +(b) No of virtual devices created on a master exceed the mac capacity and + puts the NIC in promiscuous mode and degraded performance is a concern. +(c) If the slave device is to be put into the hostile / untrusted network + namespace where L2 on the slave could be changed / misused. + + +6. Example configuration: +========================= + +:: + + +=============================================================+ + | Host: host1 | + | | + | +----------------------+ +----------------------+ | + | | NS:ns0 | | NS:ns1 | | + | | | | | | + | | | | | | + | | ipvl0 | | ipvl1 | | + | +----------#-----------+ +-----------#----------+ | + | # # | + | ################################ | + | # eth0 | + +==============================#==============================+ + + +(a) Create two network namespaces - ns0, ns1:: + + ip netns add ns0 + ip netns add ns1 + +(b) Create two ipvlan slaves on eth0 (master device):: + + ip link add link eth0 ipvl0 type ipvlan mode l2 + ip link add link eth0 ipvl1 type ipvlan mode l2 + +(c) Assign slaves to the respective network namespaces:: + + ip link set dev ipvl0 netns ns0 + ip link set dev ipvl1 netns ns1 + +(d) Now switch to the namespace (ns0 or ns1) to configure the slave devices + + - For ns0:: + + (1) ip netns exec ns0 bash + (2) ip link set dev ipvl0 up + (3) ip link set dev lo up + (4) ip -4 addr add 127.0.0.1 dev lo + (5) ip -4 addr add $IPADDR dev ipvl0 + (6) ip -4 route add default via $ROUTER dev ipvl0 + + - For ns1:: + + (1) ip netns exec ns1 bash + (2) ip link set dev ipvl1 up + (3) ip link set dev lo up + (4) ip -4 addr add 127.0.0.1 dev lo + (5) ip -4 addr add $IPADDR dev ipvl1 + (6) ip -4 route add default via $ROUTER dev ipvl1 diff --git a/Documentation/networking/ipvlan.txt b/Documentation/networking/ipvlan.txt deleted file mode 100644 index 27a38e50c287..000000000000 --- a/Documentation/networking/ipvlan.txt +++ /dev/null @@ -1,146 +0,0 @@ - - IPVLAN Driver HOWTO - -Initial Release: - Mahesh Bandewar - -1. Introduction: - This is conceptually very similar to the macvlan driver with one major -exception of using L3 for mux-ing /demux-ing among slaves. This property makes -the master device share the L2 with it's slave devices. I have developed this -driver in conjunction with network namespaces and not sure if there is use case -outside of it. - - -2. Building and Installation: - In order to build the driver, please select the config item CONFIG_IPVLAN. -The driver can be built into the kernel (CONFIG_IPVLAN=y) or as a module -(CONFIG_IPVLAN=m). - - -3. Configuration: - There are no module parameters for this driver and it can be configured -using IProute2/ip utility. - - ip link add link name type ipvlan [ mode MODE ] [ FLAGS ] - where - MODE: l3 (default) | l3s | l2 - FLAGS: bridge (default) | private | vepa - - e.g. - (a) Following will create IPvlan link with eth0 as master in - L3 bridge mode - bash# ip link add link eth0 name ipvl0 type ipvlan - (b) This command will create IPvlan link in L2 bridge mode. - bash# ip link add link eth0 name ipvl0 type ipvlan mode l2 bridge - (c) This command will create an IPvlan device in L2 private mode. - bash# ip link add link eth0 name ipvlan type ipvlan mode l2 private - (d) This command will create an IPvlan device in L2 vepa mode. - bash# ip link add link eth0 name ipvlan type ipvlan mode l2 vepa - - -4. Operating modes: - IPvlan has two modes of operation - L2 and L3. For a given master device, -you can select one of these two modes and all slaves on that master will -operate in the same (selected) mode. The RX mode is almost identical except -that in L3 mode the slaves wont receive any multicast / broadcast traffic. -L3 mode is more restrictive since routing is controlled from the other (mostly) -default namespace. - -4.1 L2 mode: - In this mode TX processing happens on the stack instance attached to the -slave device and packets are switched and queued to the master device to send -out. In this mode the slaves will RX/TX multicast and broadcast (if applicable) -as well. - -4.2 L3 mode: - In this mode TX processing up to L3 happens on the stack instance attached -to the slave device and packets are switched to the stack instance of the -master device for the L2 processing and routing from that instance will be -used before packets are queued on the outbound device. In this mode the slaves -will not receive nor can send multicast / broadcast traffic. - -4.3 L3S mode: - This is very similar to the L3 mode except that iptables (conn-tracking) -works in this mode and hence it is L3-symmetric (L3s). This will have slightly less -performance but that shouldn't matter since you are choosing this mode over plain-L3 -mode to make conn-tracking work. - -5. Mode flags: - At this time following mode flags are available - -5.1 bridge: - This is the default option. To configure the IPvlan port in this mode, -user can choose to either add this option on the command-line or don't specify -anything. This is the traditional mode where slaves can cross-talk among -themselves apart from talking through the master device. - -5.2 private: - If this option is added to the command-line, the port is set in private -mode. i.e. port won't allow cross communication between slaves. - -5.3 vepa: - If this is added to the command-line, the port is set in VEPA mode. -i.e. port will offload switching functionality to the external entity as -described in 802.1Qbg -Note: VEPA mode in IPvlan has limitations. IPvlan uses the mac-address of the -master-device, so the packets which are emitted in this mode for the adjacent -neighbor will have source and destination mac same. This will make the switch / -router send the redirect message. - -6. What to choose (macvlan vs. ipvlan)? - These two devices are very similar in many regards and the specific use -case could very well define which device to choose. if one of the following -situations defines your use case then you can choose to use ipvlan - - (a) The Linux host that is connected to the external switch / router has -policy configured that allows only one mac per port. - (b) No of virtual devices created on a master exceed the mac capacity and -puts the NIC in promiscuous mode and degraded performance is a concern. - (c) If the slave device is to be put into the hostile / untrusted network -namespace where L2 on the slave could be changed / misused. - - -6. Example configuration: - - +=============================================================+ - | Host: host1 | - | | - | +----------------------+ +----------------------+ | - | | NS:ns0 | | NS:ns1 | | - | | | | | | - | | | | | | - | | ipvl0 | | ipvl1 | | - | +----------#-----------+ +-----------#----------+ | - | # # | - | ################################ | - | # eth0 | - +==============================#==============================+ - - - (a) Create two network namespaces - ns0, ns1 - ip netns add ns0 - ip netns add ns1 - - (b) Create two ipvlan slaves on eth0 (master device) - ip link add link eth0 ipvl0 type ipvlan mode l2 - ip link add link eth0 ipvl1 type ipvlan mode l2 - - (c) Assign slaves to the respective network namespaces - ip link set dev ipvl0 netns ns0 - ip link set dev ipvl1 netns ns1 - - (d) Now switch to the namespace (ns0 or ns1) to configure the slave devices - - For ns0 - (1) ip netns exec ns0 bash - (2) ip link set dev ipvl0 up - (3) ip link set dev lo up - (4) ip -4 addr add 127.0.0.1 dev lo - (5) ip -4 addr add $IPADDR dev ipvl0 - (6) ip -4 route add default via $ROUTER dev ipvl0 - - For ns1 - (1) ip netns exec ns1 bash - (2) ip link set dev ipvl1 up - (3) ip link set dev lo up - (4) ip -4 addr add 127.0.0.1 dev lo - (5) ip -4 addr add $IPADDR dev ipvl1 - (6) ip -4 route add default via $ROUTER dev ipvl1 -- cgit v1.2.3 From 82a07bf33d7d0c3a194f62178e0fea2d68227b89 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:52 +0200 Subject: docs: networking: convert ipvs-sysctl.txt to ReST - add SPDX header; - add a document title; - mark lists as such; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Simon Horman Signed-off-by: David S. Miller --- Documentation/admin-guide/sysctl/net.rst | 4 +- Documentation/networking/index.rst | 1 + Documentation/networking/ipvs-sysctl.rst | 302 +++++++++++++++++++++++++++++++ Documentation/networking/ipvs-sysctl.txt | 294 ------------------------------ MAINTAINERS | 2 +- 5 files changed, 306 insertions(+), 297 deletions(-) create mode 100644 Documentation/networking/ipvs-sysctl.rst delete mode 100644 Documentation/networking/ipvs-sysctl.txt (limited to 'Documentation') diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst index 84e3348a9543..2ad1b77a7182 100644 --- a/Documentation/admin-guide/sysctl/net.rst +++ b/Documentation/admin-guide/sysctl/net.rst @@ -353,8 +353,8 @@ socket's buffer. It will not take effect unless PF_UNIX flag is specified. 3. /proc/sys/net/ipv4 - IPV4 settings ------------------------------------- -Please see: Documentation/networking/ip-sysctl.rst and ipvs-sysctl.txt for -descriptions of these entries. +Please see: Documentation/networking/ip-sysctl.rst and +Documentation/admin-guide/sysctl/net.rst for descriptions of these entries. 4. Appletalk diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 54dee1575b54..bbd4e0041457 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -72,6 +72,7 @@ Contents: ip-sysctl ipv6 ipvlan + ipvs-sysctl .. only:: subproject and html diff --git a/Documentation/networking/ipvs-sysctl.rst b/Documentation/networking/ipvs-sysctl.rst new file mode 100644 index 000000000000..be36c4600e8f --- /dev/null +++ b/Documentation/networking/ipvs-sysctl.rst @@ -0,0 +1,302 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========== +IPvs-sysctl +=========== + +/proc/sys/net/ipv4/vs/* Variables: +================================== + +am_droprate - INTEGER + default 10 + + It sets the always mode drop rate, which is used in the mode 3 + of the drop_rate defense. + +amemthresh - INTEGER + default 1024 + + It sets the available memory threshold (in pages), which is + used in the automatic modes of defense. When there is no + enough available memory, the respective strategy will be + enabled and the variable is automatically set to 2, otherwise + the strategy is disabled and the variable is set to 1. + +backup_only - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + If set, disable the director function while the server is + in backup mode to avoid packet loops for DR/TUN methods. + +conn_reuse_mode - INTEGER + 1 - default + + Controls how ipvs will deal with connections that are detected + port reuse. It is a bitmap, with the values being: + + 0: disable any special handling on port reuse. The new + connection will be delivered to the same real server that was + servicing the previous connection. This will effectively + disable expire_nodest_conn. + + bit 1: enable rescheduling of new connections when it is safe. + That is, whenever expire_nodest_conn and for TCP sockets, when + the connection is in TIME_WAIT state (which is only possible if + you use NAT mode). + + bit 2: it is bit 1 plus, for TCP connections, when connections + are in FIN_WAIT state, as this is the last state seen by load + balancer in Direct Routing mode. This bit helps on adding new + real servers to a very busy cluster. + +conntrack - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + If set, maintain connection tracking entries for + connections handled by IPVS. + + This should be enabled if connections handled by IPVS are to be + also handled by stateful firewall rules. That is, iptables rules + that make use of connection tracking. It is a performance + optimisation to disable this setting otherwise. + + Connections handled by the IPVS FTP application module + will have connection tracking entries regardless of this setting. + + Only available when IPVS is compiled with CONFIG_IP_VS_NFCT enabled. + +cache_bypass - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + If it is enabled, forward packets to the original destination + directly when no cache server is available and destination + address is not local (iph->daddr is RTN_UNICAST). It is mostly + used in transparent web cache cluster. + +debug_level - INTEGER + - 0 - transmission error messages (default) + - 1 - non-fatal error messages + - 2 - configuration + - 3 - destination trash + - 4 - drop entry + - 5 - service lookup + - 6 - scheduling + - 7 - connection new/expire, lookup and synchronization + - 8 - state transition + - 9 - binding destination, template checks and applications + - 10 - IPVS packet transmission + - 11 - IPVS packet handling (ip_vs_in/ip_vs_out) + - 12 or more - packet traversal + + Only available when IPVS is compiled with CONFIG_IP_VS_DEBUG enabled. + + Higher debugging levels include the messages for lower debugging + levels, so setting debug level 2, includes level 0, 1 and 2 + messages. Thus, logging becomes more and more verbose the higher + the level. + +drop_entry - INTEGER + - 0 - disabled (default) + + The drop_entry defense is to randomly drop entries in the + connection hash table, just in order to collect back some + memory for new connections. In the current code, the + drop_entry procedure can be activated every second, then it + randomly scans 1/32 of the whole and drops entries that are in + the SYN-RECV/SYNACK state, which should be effective against + syn-flooding attack. + + The valid values of drop_entry are from 0 to 3, where 0 means + that this strategy is always disabled, 1 and 2 mean automatic + modes (when there is no enough available memory, the strategy + is enabled and the variable is automatically set to 2, + otherwise the strategy is disabled and the variable is set to + 1), and 3 means that that the strategy is always enabled. + +drop_packet - INTEGER + - 0 - disabled (default) + + The drop_packet defense is designed to drop 1/rate packets + before forwarding them to real servers. If the rate is 1, then + drop all the incoming packets. + + The value definition is the same as that of the drop_entry. In + the automatic mode, the rate is determined by the follow + formula: rate = amemthresh / (amemthresh - available_memory) + when available memory is less than the available memory + threshold. When the mode 3 is set, the always mode drop rate + is controlled by the /proc/sys/net/ipv4/vs/am_droprate. + +expire_nodest_conn - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + The default value is 0, the load balancer will silently drop + packets when its destination server is not available. It may + be useful, when user-space monitoring program deletes the + destination server (because of server overload or wrong + detection) and add back the server later, and the connections + to the server can continue. + + If this feature is enabled, the load balancer will expire the + connection immediately when a packet arrives and its + destination server is not available, then the client program + will be notified that the connection is closed. This is + equivalent to the feature some people requires to flush + connections when its destination is not available. + +expire_quiescent_template - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + When set to a non-zero value, the load balancer will expire + persistent templates when the destination server is quiescent. + This may be useful, when a user makes a destination server + quiescent by setting its weight to 0 and it is desired that + subsequent otherwise persistent connections are sent to a + different destination server. By default new persistent + connections are allowed to quiescent destination servers. + + If this feature is enabled, the load balancer will expire the + persistence template if it is to be used to schedule a new + connection and the destination server is quiescent. + +ignore_tunneled - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + If set, ipvs will set the ipvs_property on all packets which are of + unrecognized protocols. This prevents us from routing tunneled + protocols like ipip, which is useful to prevent rescheduling + packets that have been tunneled to the ipvs host (i.e. to prevent + ipvs routing loops when ipvs is also acting as a real server). + +nat_icmp_send - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + It controls sending icmp error messages (ICMP_DEST_UNREACH) + for VS/NAT when the load balancer receives packets from real + servers but the connection entries don't exist. + +pmtu_disc - BOOLEAN + - 0 - disabled + - not 0 - enabled (default) + + By default, reject with FRAG_NEEDED all DF packets that exceed + the PMTU, irrespective of the forwarding method. For TUN method + the flag can be disabled to fragment such packets. + +secure_tcp - INTEGER + - 0 - disabled (default) + + The secure_tcp defense is to use a more complicated TCP state + transition table. For VS/NAT, it also delays entering the + TCP ESTABLISHED state until the three way handshake is completed. + + The value definition is the same as that of drop_entry and + drop_packet. + +sync_threshold - vector of 2 INTEGERs: sync_threshold, sync_period + default 3 50 + + It sets synchronization threshold, which is the minimum number + of incoming packets that a connection needs to receive before + the connection will be synchronized. A connection will be + synchronized, every time the number of its incoming packets + modulus sync_period equals the threshold. The range of the + threshold is from 0 to sync_period. + + When sync_period and sync_refresh_period are 0, send sync only + for state changes or only once when pkts matches sync_threshold + +sync_refresh_period - UNSIGNED INTEGER + default 0 + + In seconds, difference in reported connection timer that triggers + new sync message. It can be used to avoid sync messages for the + specified period (or half of the connection timeout if it is lower) + if connection state is not changed since last sync. + + This is useful for normal connections with high traffic to reduce + sync rate. Additionally, retry sync_retries times with period of + sync_refresh_period/8. + +sync_retries - INTEGER + default 0 + + Defines sync retries with period of sync_refresh_period/8. Useful + to protect against loss of sync messages. The range of the + sync_retries is from 0 to 3. + +sync_qlen_max - UNSIGNED LONG + + Hard limit for queued sync messages that are not sent yet. It + defaults to 1/32 of the memory pages but actually represents + number of messages. It will protect us from allocating large + parts of memory when the sending rate is lower than the queuing + rate. + +sync_sock_size - INTEGER + default 0 + + Configuration of SNDBUF (master) or RCVBUF (slave) socket limit. + Default value is 0 (preserve system defaults). + +sync_ports - INTEGER + default 1 + + The number of threads that master and backup servers can use for + sync traffic. Every thread will use single UDP port, thread 0 will + use the default port 8848 while last thread will use port + 8848+sync_ports-1. + +snat_reroute - BOOLEAN + - 0 - disabled + - not 0 - enabled (default) + + If enabled, recalculate the route of SNATed packets from + realservers so that they are routed as if they originate from the + director. Otherwise they are routed as if they are forwarded by the + director. + + If policy routing is in effect then it is possible that the route + of a packet originating from a director is routed differently to a + packet being forwarded by the director. + + If policy routing is not in effect then the recalculated route will + always be the same as the original route so it is an optimisation + to disable snat_reroute and avoid the recalculation. + +sync_persist_mode - INTEGER + default 0 + + Controls the synchronisation of connections when using persistence + + 0: All types of connections are synchronised + + 1: Attempt to reduce the synchronisation traffic depending on + the connection type. For persistent services avoid synchronisation + for normal connections, do it only for persistence templates. + In such case, for TCP and SCTP it may need enabling sloppy_tcp and + sloppy_sctp flags on backup servers. For non-persistent services + such optimization is not applied, mode 0 is assumed. + +sync_version - INTEGER + default 1 + + The version of the synchronisation protocol used when sending + synchronisation messages. + + 0 selects the original synchronisation protocol (version 0). This + should be used when sending synchronisation messages to a legacy + system that only understands the original synchronisation protocol. + + 1 selects the current synchronisation protocol (version 1). This + should be used where possible. + + Kernels with this sync_version entry are able to receive messages + of both version 1 and version 2 of the synchronisation protocol. diff --git a/Documentation/networking/ipvs-sysctl.txt b/Documentation/networking/ipvs-sysctl.txt deleted file mode 100644 index 056898685d40..000000000000 --- a/Documentation/networking/ipvs-sysctl.txt +++ /dev/null @@ -1,294 +0,0 @@ -/proc/sys/net/ipv4/vs/* Variables: - -am_droprate - INTEGER - default 10 - - It sets the always mode drop rate, which is used in the mode 3 - of the drop_rate defense. - -amemthresh - INTEGER - default 1024 - - It sets the available memory threshold (in pages), which is - used in the automatic modes of defense. When there is no - enough available memory, the respective strategy will be - enabled and the variable is automatically set to 2, otherwise - the strategy is disabled and the variable is set to 1. - -backup_only - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - If set, disable the director function while the server is - in backup mode to avoid packet loops for DR/TUN methods. - -conn_reuse_mode - INTEGER - 1 - default - - Controls how ipvs will deal with connections that are detected - port reuse. It is a bitmap, with the values being: - - 0: disable any special handling on port reuse. The new - connection will be delivered to the same real server that was - servicing the previous connection. This will effectively - disable expire_nodest_conn. - - bit 1: enable rescheduling of new connections when it is safe. - That is, whenever expire_nodest_conn and for TCP sockets, when - the connection is in TIME_WAIT state (which is only possible if - you use NAT mode). - - bit 2: it is bit 1 plus, for TCP connections, when connections - are in FIN_WAIT state, as this is the last state seen by load - balancer in Direct Routing mode. This bit helps on adding new - real servers to a very busy cluster. - -conntrack - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - If set, maintain connection tracking entries for - connections handled by IPVS. - - This should be enabled if connections handled by IPVS are to be - also handled by stateful firewall rules. That is, iptables rules - that make use of connection tracking. It is a performance - optimisation to disable this setting otherwise. - - Connections handled by the IPVS FTP application module - will have connection tracking entries regardless of this setting. - - Only available when IPVS is compiled with CONFIG_IP_VS_NFCT enabled. - -cache_bypass - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - If it is enabled, forward packets to the original destination - directly when no cache server is available and destination - address is not local (iph->daddr is RTN_UNICAST). It is mostly - used in transparent web cache cluster. - -debug_level - INTEGER - 0 - transmission error messages (default) - 1 - non-fatal error messages - 2 - configuration - 3 - destination trash - 4 - drop entry - 5 - service lookup - 6 - scheduling - 7 - connection new/expire, lookup and synchronization - 8 - state transition - 9 - binding destination, template checks and applications - 10 - IPVS packet transmission - 11 - IPVS packet handling (ip_vs_in/ip_vs_out) - 12 or more - packet traversal - - Only available when IPVS is compiled with CONFIG_IP_VS_DEBUG enabled. - - Higher debugging levels include the messages for lower debugging - levels, so setting debug level 2, includes level 0, 1 and 2 - messages. Thus, logging becomes more and more verbose the higher - the level. - -drop_entry - INTEGER - 0 - disabled (default) - - The drop_entry defense is to randomly drop entries in the - connection hash table, just in order to collect back some - memory for new connections. In the current code, the - drop_entry procedure can be activated every second, then it - randomly scans 1/32 of the whole and drops entries that are in - the SYN-RECV/SYNACK state, which should be effective against - syn-flooding attack. - - The valid values of drop_entry are from 0 to 3, where 0 means - that this strategy is always disabled, 1 and 2 mean automatic - modes (when there is no enough available memory, the strategy - is enabled and the variable is automatically set to 2, - otherwise the strategy is disabled and the variable is set to - 1), and 3 means that that the strategy is always enabled. - -drop_packet - INTEGER - 0 - disabled (default) - - The drop_packet defense is designed to drop 1/rate packets - before forwarding them to real servers. If the rate is 1, then - drop all the incoming packets. - - The value definition is the same as that of the drop_entry. In - the automatic mode, the rate is determined by the follow - formula: rate = amemthresh / (amemthresh - available_memory) - when available memory is less than the available memory - threshold. When the mode 3 is set, the always mode drop rate - is controlled by the /proc/sys/net/ipv4/vs/am_droprate. - -expire_nodest_conn - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - The default value is 0, the load balancer will silently drop - packets when its destination server is not available. It may - be useful, when user-space monitoring program deletes the - destination server (because of server overload or wrong - detection) and add back the server later, and the connections - to the server can continue. - - If this feature is enabled, the load balancer will expire the - connection immediately when a packet arrives and its - destination server is not available, then the client program - will be notified that the connection is closed. This is - equivalent to the feature some people requires to flush - connections when its destination is not available. - -expire_quiescent_template - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - When set to a non-zero value, the load balancer will expire - persistent templates when the destination server is quiescent. - This may be useful, when a user makes a destination server - quiescent by setting its weight to 0 and it is desired that - subsequent otherwise persistent connections are sent to a - different destination server. By default new persistent - connections are allowed to quiescent destination servers. - - If this feature is enabled, the load balancer will expire the - persistence template if it is to be used to schedule a new - connection and the destination server is quiescent. - -ignore_tunneled - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - If set, ipvs will set the ipvs_property on all packets which are of - unrecognized protocols. This prevents us from routing tunneled - protocols like ipip, which is useful to prevent rescheduling - packets that have been tunneled to the ipvs host (i.e. to prevent - ipvs routing loops when ipvs is also acting as a real server). - -nat_icmp_send - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - It controls sending icmp error messages (ICMP_DEST_UNREACH) - for VS/NAT when the load balancer receives packets from real - servers but the connection entries don't exist. - -pmtu_disc - BOOLEAN - 0 - disabled - not 0 - enabled (default) - - By default, reject with FRAG_NEEDED all DF packets that exceed - the PMTU, irrespective of the forwarding method. For TUN method - the flag can be disabled to fragment such packets. - -secure_tcp - INTEGER - 0 - disabled (default) - - The secure_tcp defense is to use a more complicated TCP state - transition table. For VS/NAT, it also delays entering the - TCP ESTABLISHED state until the three way handshake is completed. - - The value definition is the same as that of drop_entry and - drop_packet. - -sync_threshold - vector of 2 INTEGERs: sync_threshold, sync_period - default 3 50 - - It sets synchronization threshold, which is the minimum number - of incoming packets that a connection needs to receive before - the connection will be synchronized. A connection will be - synchronized, every time the number of its incoming packets - modulus sync_period equals the threshold. The range of the - threshold is from 0 to sync_period. - - When sync_period and sync_refresh_period are 0, send sync only - for state changes or only once when pkts matches sync_threshold - -sync_refresh_period - UNSIGNED INTEGER - default 0 - - In seconds, difference in reported connection timer that triggers - new sync message. It can be used to avoid sync messages for the - specified period (or half of the connection timeout if it is lower) - if connection state is not changed since last sync. - - This is useful for normal connections with high traffic to reduce - sync rate. Additionally, retry sync_retries times with period of - sync_refresh_period/8. - -sync_retries - INTEGER - default 0 - - Defines sync retries with period of sync_refresh_period/8. Useful - to protect against loss of sync messages. The range of the - sync_retries is from 0 to 3. - -sync_qlen_max - UNSIGNED LONG - - Hard limit for queued sync messages that are not sent yet. It - defaults to 1/32 of the memory pages but actually represents - number of messages. It will protect us from allocating large - parts of memory when the sending rate is lower than the queuing - rate. - -sync_sock_size - INTEGER - default 0 - - Configuration of SNDBUF (master) or RCVBUF (slave) socket limit. - Default value is 0 (preserve system defaults). - -sync_ports - INTEGER - default 1 - - The number of threads that master and backup servers can use for - sync traffic. Every thread will use single UDP port, thread 0 will - use the default port 8848 while last thread will use port - 8848+sync_ports-1. - -snat_reroute - BOOLEAN - 0 - disabled - not 0 - enabled (default) - - If enabled, recalculate the route of SNATed packets from - realservers so that they are routed as if they originate from the - director. Otherwise they are routed as if they are forwarded by the - director. - - If policy routing is in effect then it is possible that the route - of a packet originating from a director is routed differently to a - packet being forwarded by the director. - - If policy routing is not in effect then the recalculated route will - always be the same as the original route so it is an optimisation - to disable snat_reroute and avoid the recalculation. - -sync_persist_mode - INTEGER - default 0 - - Controls the synchronisation of connections when using persistence - - 0: All types of connections are synchronised - 1: Attempt to reduce the synchronisation traffic depending on - the connection type. For persistent services avoid synchronisation - for normal connections, do it only for persistence templates. - In such case, for TCP and SCTP it may need enabling sloppy_tcp and - sloppy_sctp flags on backup servers. For non-persistent services - such optimization is not applied, mode 0 is assumed. - -sync_version - INTEGER - default 1 - - The version of the synchronisation protocol used when sending - synchronisation messages. - - 0 selects the original synchronisation protocol (version 0). This - should be used when sending synchronisation messages to a legacy - system that only understands the original synchronisation protocol. - - 1 selects the current synchronisation protocol (version 1). This - should be used where possible. - - Kernels with this sync_version entry are able to receive messages - of both version 1 and version 2 of the synchronisation protocol. diff --git a/MAINTAINERS b/MAINTAINERS index df5e4ccc1ccb..3a5f52a3c055 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8934,7 +8934,7 @@ L: lvs-devel@vger.kernel.org S: Maintained T: git git://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next.git T: git git://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs.git -F: Documentation/networking/ipvs-sysctl.txt +F: Documentation/networking/ipvs-sysctl.rst F: include/net/ip_vs.h F: include/uapi/linux/ip_vs.h F: net/netfilter/ipvs/ -- cgit v1.2.3 From b9dd2bea2245dd8ba4f68e801af93e4b38bfe6b0 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 28 Apr 2020 00:01:53 +0200 Subject: docs: networking: convert kcm.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/kcm.rst | 290 +++++++++++++++++++++++++++++++++++++ Documentation/networking/kcm.txt | 285 ------------------------------------ 3 files changed, 291 insertions(+), 285 deletions(-) create mode 100644 Documentation/networking/kcm.rst delete mode 100644 Documentation/networking/kcm.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index bbd4e0041457..e1ff08b94d90 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -73,6 +73,7 @@ Contents: ipv6 ipvlan ipvs-sysctl + kcm .. only:: subproject and html diff --git a/Documentation/networking/kcm.rst b/Documentation/networking/kcm.rst new file mode 100644 index 000000000000..db0f5560ac1c --- /dev/null +++ b/Documentation/networking/kcm.rst @@ -0,0 +1,290 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================= +Kernel Connection Multiplexor +============================= + +Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based +interface over TCP for generic application protocols. With KCM an application +can efficiently send and receive application protocol messages over TCP using +datagram sockets. + +KCM implements an NxM multiplexor in the kernel as diagrammed below:: + + +------------+ +------------+ +------------+ +------------+ + | KCM socket | | KCM socket | | KCM socket | | KCM socket | + +------------+ +------------+ +------------+ +------------+ + | | | | + +-----------+ | | +----------+ + | | | | + +----------------------------------+ + | Multiplexor | + +----------------------------------+ + | | | | | + +---------+ | | | ------------+ + | | | | | + +----------+ +----------+ +----------+ +----------+ +----------+ + | Psock | | Psock | | Psock | | Psock | | Psock | + +----------+ +----------+ +----------+ +----------+ +----------+ + | | | | | + +----------+ +----------+ +----------+ +----------+ +----------+ + | TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock | + +----------+ +----------+ +----------+ +----------+ +----------+ + +KCM sockets +=========== + +The KCM sockets provide the user interface to the multiplexor. All the KCM sockets +bound to a multiplexor are considered to have equivalent function, and I/O +operations in different sockets may be done in parallel without the need for +synchronization between threads in userspace. + +Multiplexor +=========== + +The multiplexor provides the message steering. In the transmit path, messages +written on a KCM socket are sent atomically on an appropriate TCP socket. +Similarly, in the receive path, messages are constructed on each TCP socket +(Psock) and complete messages are steered to a KCM socket. + +TCP sockets & Psocks +==================== + +TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated +for each bound TCP socket, this structure holds the state for constructing +messages on receive as well as other connection specific information for KCM. + +Connected mode semantics +======================== + +Each multiplexor assumes that all attached TCP connections are to the same +destination and can use the different connections for load balancing when +transmitting. The normal send and recv calls (include sendmmsg and recvmmsg) +can be used to send and receive messages from the KCM socket. + +Socket types +============ + +KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types. + +Message delineation +------------------- + +Messages are sent over a TCP stream with some application protocol message +format that typically includes a header which frames the messages. The length +of a received message can be deduced from the application protocol header +(often just a simple length field). + +A TCP stream must be parsed to determine message boundaries. Berkeley Packet +Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a +BPF program must be specified. The program is called at the start of receiving +a new message and is given an skbuff that contains the bytes received so far. +It parses the message header and returns the length of the message. Given this +information, KCM will construct the message of the stated length and deliver it +to a KCM socket. + +TCP socket management +--------------------- + +When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and +write space available (POLLOUT) events are handled by the multiplexor. If there +is a state change (disconnection) or other error on a TCP socket, an error is +posted on the TCP socket so that a POLLERR event happens and KCM discontinues +using the socket. When the application gets the error notification for a +TCP socket, it should unattach the socket from KCM and then handle the error +condition (the typical response is to close the socket and create a new +connection if necessary). + +KCM limits the maximum receive message size to be the size of the receive +socket buffer on the attached TCP socket (the socket buffer size can be set by +SO_RCVBUF). If the length of a new message reported by the BPF program is +greater than this limit a corresponding error (EMSGSIZE) is posted on the TCP +socket. The BPF program may also enforce a maximum messages size and report an +error when it is exceeded. + +A timeout may be set for assembling messages on a receive socket. The timeout +value is taken from the receive timeout of the attached TCP socket (this is set +by SO_RCVTIMEO). If the timer expires before assembly is complete an error +(ETIMEDOUT) is posted on the socket. + +User interface +============== + +Creating a multiplexor +---------------------- + +A new multiplexor and initial KCM socket is created by a socket call:: + + socket(AF_KCM, type, protocol) + +- type is either SOCK_DGRAM or SOCK_SEQPACKET +- protocol is KCMPROTO_CONNECTED + +Cloning KCM sockets +------------------- + +After the first KCM socket is created using the socket call as described +above, additional sockets for the multiplexor can be created by cloning +a KCM socket. This is accomplished by an ioctl on a KCM socket:: + + /* From linux/kcm.h */ + struct kcm_clone { + int fd; + }; + + struct kcm_clone info; + + memset(&info, 0, sizeof(info)); + + err = ioctl(kcmfd, SIOCKCMCLONE, &info); + + if (!err) + newkcmfd = info.fd; + +Attach transport sockets +------------------------ + +Attaching of transport sockets to a multiplexor is performed by calling an +ioctl on a KCM socket for the multiplexor. e.g.:: + + /* From linux/kcm.h */ + struct kcm_attach { + int fd; + int bpf_fd; + }; + + struct kcm_attach info; + + memset(&info, 0, sizeof(info)); + + info.fd = tcpfd; + info.bpf_fd = bpf_prog_fd; + + ioctl(kcmfd, SIOCKCMATTACH, &info); + +The kcm_attach structure contains: + + - fd: file descriptor for TCP socket being attached + - bpf_prog_fd: file descriptor for compiled BPF program downloaded + +Unattach transport sockets +-------------------------- + +Unattaching a transport socket from a multiplexor is straightforward. An +"unattach" ioctl is done with the kcm_unattach structure as the argument:: + + /* From linux/kcm.h */ + struct kcm_unattach { + int fd; + }; + + struct kcm_unattach info; + + memset(&info, 0, sizeof(info)); + + info.fd = cfd; + + ioctl(fd, SIOCKCMUNATTACH, &info); + +Disabling receive on KCM socket +------------------------------- + +A setsockopt is used to disable or enable receiving on a KCM socket. +When receive is disabled, any pending messages in the socket's +receive buffer are moved to other sockets. This feature is useful +if an application thread knows that it will be doing a lot of +work on a request and won't be able to service new messages for a +while. Example use:: + + int val = 1; + + setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val)) + +BFP programs for message delineation +------------------------------------ + +BPF programs can be compiled using the BPF LLVM backend. For example, +the BPF program for parsing Thrift is:: + + #include "bpf.h" /* for __sk_buff */ + #include "bpf_helpers.h" /* for load_word intrinsic */ + + SEC("socket_kcm") + int bpf_prog1(struct __sk_buff *skb) + { + return load_word(skb, 0) + 4; + } + + char _license[] SEC("license") = "GPL"; + +Use in applications +=================== + +KCM accelerates application layer protocols. Specifically, it allows +applications to use a message based interface for sending and receiving +messages. The kernel provides necessary assurances that messages are sent +and received atomically. This relieves much of the burden applications have +in mapping a message based protocol onto the TCP stream. KCM also make +application layer messages a unit of work in the kernel for the purposes of +steering and scheduling, which in turn allows a simpler networking model in +multithreaded applications. + +Configurations +-------------- + +In an Nx1 configuration, KCM logically provides multiple socket handles +to the same TCP connection. This allows parallelism between in I/O +operations on the TCP socket (for instance copyin and copyout of data is +parallelized). In an application, a KCM socket can be opened for each +processing thread and inserted into the epoll (similar to how SO_REUSEPORT +is used to allow multiple listener sockets on the same port). + +In a MxN configuration, multiple connections are established to the +same destination. These are used for simple load balancing. + +Message batching +---------------- + +The primary purpose of KCM is load balancing between KCM sockets and hence +threads in a nominal use case. Perfect load balancing, that is steering +each received message to a different KCM socket or steering each sent +message to a different TCP socket, can negatively impact performance +since this doesn't allow for affinities to be established. Balancing +based on groups, or batches of messages, can be beneficial for performance. + +On transmit, there are three ways an application can batch (pipeline) +messages on a KCM socket. + + 1) Send multiple messages in a single sendmmsg. + 2) Send a group of messages each with a sendmsg call, where all messages + except the last have MSG_BATCH in the flags of sendmsg call. + 3) Create "super message" composed of multiple messages and send this + with a single sendmsg. + +On receive, the KCM module attempts to queue messages received on the +same KCM socket during each TCP ready callback. The targeted KCM socket +changes at each receive ready callback on the KCM socket. The application +does not need to configure this. + +Error handling +-------------- + +An application should include a thread to monitor errors raised on +the TCP connection. Normally, this will be done by placing each +TCP socket attached to a KCM multiplexor in epoll set for POLLERR +event. If an error occurs on an attached TCP socket, KCM sets an EPIPE +on the socket thus waking up the application thread. When the application +sees the error (which may just be a disconnect) it should unattach the +socket from KCM and then close it. It is assumed that once an error is +posted on the TCP socket the data stream is unrecoverable (i.e. an error +may have occurred in the middle of receiving a message). + +TCP connection monitoring +------------------------- + +In KCM there is no means to correlate a message to the TCP socket that +was used to send or receive the message (except in the case there is +only one attached TCP socket). However, the application does retain +an open file descriptor to the socket so it will be able to get statistics +from the socket which can be used in detecting issues (such as high +retransmissions on the socket). diff --git a/Documentation/networking/kcm.txt b/Documentation/networking/kcm.txt deleted file mode 100644 index b773a5278ac4..000000000000 --- a/Documentation/networking/kcm.txt +++ /dev/null @@ -1,285 +0,0 @@ -Kernel Connection Multiplexor ------------------------------ - -Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based -interface over TCP for generic application protocols. With KCM an application -can efficiently send and receive application protocol messages over TCP using -datagram sockets. - -KCM implements an NxM multiplexor in the kernel as diagrammed below: - -+------------+ +------------+ +------------+ +------------+ -| KCM socket | | KCM socket | | KCM socket | | KCM socket | -+------------+ +------------+ +------------+ +------------+ - | | | | - +-----------+ | | +----------+ - | | | | - +----------------------------------+ - | Multiplexor | - +----------------------------------+ - | | | | | - +---------+ | | | ------------+ - | | | | | -+----------+ +----------+ +----------+ +----------+ +----------+ -| Psock | | Psock | | Psock | | Psock | | Psock | -+----------+ +----------+ +----------+ +----------+ +----------+ - | | | | | -+----------+ +----------+ +----------+ +----------+ +----------+ -| TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock | -+----------+ +----------+ +----------+ +----------+ +----------+ - -KCM sockets ------------ - -The KCM sockets provide the user interface to the multiplexor. All the KCM sockets -bound to a multiplexor are considered to have equivalent function, and I/O -operations in different sockets may be done in parallel without the need for -synchronization between threads in userspace. - -Multiplexor ------------ - -The multiplexor provides the message steering. In the transmit path, messages -written on a KCM socket are sent atomically on an appropriate TCP socket. -Similarly, in the receive path, messages are constructed on each TCP socket -(Psock) and complete messages are steered to a KCM socket. - -TCP sockets & Psocks --------------------- - -TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated -for each bound TCP socket, this structure holds the state for constructing -messages on receive as well as other connection specific information for KCM. - -Connected mode semantics ------------------------- - -Each multiplexor assumes that all attached TCP connections are to the same -destination and can use the different connections for load balancing when -transmitting. The normal send and recv calls (include sendmmsg and recvmmsg) -can be used to send and receive messages from the KCM socket. - -Socket types ------------- - -KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types. - -Message delineation -------------------- - -Messages are sent over a TCP stream with some application protocol message -format that typically includes a header which frames the messages. The length -of a received message can be deduced from the application protocol header -(often just a simple length field). - -A TCP stream must be parsed to determine message boundaries. Berkeley Packet -Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a -BPF program must be specified. The program is called at the start of receiving -a new message and is given an skbuff that contains the bytes received so far. -It parses the message header and returns the length of the message. Given this -information, KCM will construct the message of the stated length and deliver it -to a KCM socket. - -TCP socket management ---------------------- - -When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and -write space available (POLLOUT) events are handled by the multiplexor. If there -is a state change (disconnection) or other error on a TCP socket, an error is -posted on the TCP socket so that a POLLERR event happens and KCM discontinues -using the socket. When the application gets the error notification for a -TCP socket, it should unattach the socket from KCM and then handle the error -condition (the typical response is to close the socket and create a new -connection if necessary). - -KCM limits the maximum receive message size to be the size of the receive -socket buffer on the attached TCP socket (the socket buffer size can be set by -SO_RCVBUF). If the length of a new message reported by the BPF program is -greater than this limit a corresponding error (EMSGSIZE) is posted on the TCP -socket. The BPF program may also enforce a maximum messages size and report an -error when it is exceeded. - -A timeout may be set for assembling messages on a receive socket. The timeout -value is taken from the receive timeout of the attached TCP socket (this is set -by SO_RCVTIMEO). If the timer expires before assembly is complete an error -(ETIMEDOUT) is posted on the socket. - -User interface -============== - -Creating a multiplexor ----------------------- - -A new multiplexor and initial KCM socket is created by a socket call: - - socket(AF_KCM, type, protocol) - - - type is either SOCK_DGRAM or SOCK_SEQPACKET - - protocol is KCMPROTO_CONNECTED - -Cloning KCM sockets -------------------- - -After the first KCM socket is created using the socket call as described -above, additional sockets for the multiplexor can be created by cloning -a KCM socket. This is accomplished by an ioctl on a KCM socket: - - /* From linux/kcm.h */ - struct kcm_clone { - int fd; - }; - - struct kcm_clone info; - - memset(&info, 0, sizeof(info)); - - err = ioctl(kcmfd, SIOCKCMCLONE, &info); - - if (!err) - newkcmfd = info.fd; - -Attach transport sockets ------------------------- - -Attaching of transport sockets to a multiplexor is performed by calling an -ioctl on a KCM socket for the multiplexor. e.g.: - - /* From linux/kcm.h */ - struct kcm_attach { - int fd; - int bpf_fd; - }; - - struct kcm_attach info; - - memset(&info, 0, sizeof(info)); - - info.fd = tcpfd; - info.bpf_fd = bpf_prog_fd; - - ioctl(kcmfd, SIOCKCMATTACH, &info); - -The kcm_attach structure contains: - fd: file descriptor for TCP socket being attached - bpf_prog_fd: file descriptor for compiled BPF program downloaded - -Unattach transport sockets --------------------------- - -Unattaching a transport socket from a multiplexor is straightforward. An -"unattach" ioctl is done with the kcm_unattach structure as the argument: - - /* From linux/kcm.h */ - struct kcm_unattach { - int fd; - }; - - struct kcm_unattach info; - - memset(&info, 0, sizeof(info)); - - info.fd = cfd; - - ioctl(fd, SIOCKCMUNATTACH, &info); - -Disabling receive on KCM socket -------------------------------- - -A setsockopt is used to disable or enable receiving on a KCM socket. -When receive is disabled, any pending messages in the socket's -receive buffer are moved to other sockets. This feature is useful -if an application thread knows that it will be doing a lot of -work on a request and won't be able to service new messages for a -while. Example use: - - int val = 1; - - setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val)) - -BFP programs for message delineation ------------------------------------- - -BPF programs can be compiled using the BPF LLVM backend. For example, -the BPF program for parsing Thrift is: - - #include "bpf.h" /* for __sk_buff */ - #include "bpf_helpers.h" /* for load_word intrinsic */ - - SEC("socket_kcm") - int bpf_prog1(struct __sk_buff *skb) - { - return load_word(skb, 0) + 4; - } - - char _license[] SEC("license") = "GPL"; - -Use in applications -=================== - -KCM accelerates application layer protocols. Specifically, it allows -applications to use a message based interface for sending and receiving -messages. The kernel provides necessary assurances that messages are sent -and received atomically. This relieves much of the burden applications have -in mapping a message based protocol onto the TCP stream. KCM also make -application layer messages a unit of work in the kernel for the purposes of -steering and scheduling, which in turn allows a simpler networking model in -multithreaded applications. - -Configurations --------------- - -In an Nx1 configuration, KCM logically provides multiple socket handles -to the same TCP connection. This allows parallelism between in I/O -operations on the TCP socket (for instance copyin and copyout of data is -parallelized). In an application, a KCM socket can be opened for each -processing thread and inserted into the epoll (similar to how SO_REUSEPORT -is used to allow multiple listener sockets on the same port). - -In a MxN configuration, multiple connections are established to the -same destination. These are used for simple load balancing. - -Message batching ----------------- - -The primary purpose of KCM is load balancing between KCM sockets and hence -threads in a nominal use case. Perfect load balancing, that is steering -each received message to a different KCM socket or steering each sent -message to a different TCP socket, can negatively impact performance -since this doesn't allow for affinities to be established. Balancing -based on groups, or batches of messages, can be beneficial for performance. - -On transmit, there are three ways an application can batch (pipeline) -messages on a KCM socket. - 1) Send multiple messages in a single sendmmsg. - 2) Send a group of messages each with a sendmsg call, where all messages - except the last have MSG_BATCH in the flags of sendmsg call. - 3) Create "super message" composed of multiple messages and send this - with a single sendmsg. - -On receive, the KCM module attempts to queue messages received on the -same KCM socket during each TCP ready callback. The targeted KCM socket -changes at each receive ready callback on the KCM socket. The application -does not need to configure this. - -Error handling --------------- - -An application should include a thread to monitor errors raised on -the TCP connection. Normally, this will be done by placing each -TCP socket attached to a KCM multiplexor in epoll set for POLLERR -event. If an error occurs on an attached TCP socket, KCM sets an EPIPE -on the socket thus waking up the application thread. When the application -sees the error (which may just be a disconnect) it should unattach the -socket from KCM and then close it. It is assumed that once an error is -posted on the TCP socket the data stream is unrecoverable (i.e. an error -may have occurred in the middle of receiving a message). - -TCP connection monitoring -------------------------- - -In KCM there is no means to correlate a message to the TCP socket that -was used to send or receive the message (except in the case there is -only one attached TCP socket). However, the application does retain -an open file descriptor to the socket so it will be able to get statistics -from the socket which can be used in detecting issues (such as high -retransmissions on the socket). -- cgit v1.2.3 From 4972ecee06612e167523f528317920fbeba5f12d Mon Sep 17 00:00:00 2001 From: Robert Marko Date: Thu, 30 Apr 2020 11:07:06 +0200 Subject: dt-bindings: add Qualcomm IPQ4019 MDIO bindings This patch adds the binding document for the IPQ40xx MDIO driver. Signed-off-by: Robert Marko Reviewed-by: Andrew Lunn Reviewed-by: Florian Fainelli Cc: Luka Perkov Signed-off-by: David S. Miller --- .../devicetree/bindings/net/qcom,ipq4019-mdio.yaml | 61 ++++++++++++++++++++++ 1 file changed, 61 insertions(+) create mode 100644 Documentation/devicetree/bindings/net/qcom,ipq4019-mdio.yaml (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/qcom,ipq4019-mdio.yaml b/Documentation/devicetree/bindings/net/qcom,ipq4019-mdio.yaml new file mode 100644 index 000000000000..13555a89975f --- /dev/null +++ b/Documentation/devicetree/bindings/net/qcom,ipq4019-mdio.yaml @@ -0,0 +1,61 @@ +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/net/qcom,ipq4019-mdio.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Qualcomm IPQ40xx MDIO Controller Device Tree Bindings + +maintainers: + - Robert Marko + +allOf: + - $ref: "mdio.yaml#" + +properties: + compatible: + const: qcom,ipq4019-mdio + + "#address-cells": + const: 1 + + "#size-cells": + const: 0 + + reg: + maxItems: 1 + +required: + - compatible + - reg + - "#address-cells" + - "#size-cells" + +examples: + - | + mdio@90000 { + #address-cells = <1>; + #size-cells = <0>; + compatible = "qcom,ipq4019-mdio"; + reg = <0x90000 0x64>; + + ethphy0: ethernet-phy@0 { + reg = <0>; + }; + + ethphy1: ethernet-phy@1 { + reg = <1>; + }; + + ethphy2: ethernet-phy@2 { + reg = <2>; + }; + + ethphy3: ethernet-phy@3 { + reg = <3>; + }; + + ethphy4: ethernet-phy@4 { + reg = <4>; + }; + }; -- cgit v1.2.3 From 10ebb22137acd97083cb52d8664cdff269e8a629 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:03:56 +0200 Subject: docs: networking: convert l2tp.txt to ReST - add SPDX header; - add a document title; - mark tables as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/l2tp.rst | 358 +++++++++++++++++++++++++++++++++++++ Documentation/networking/l2tp.txt | 345 ----------------------------------- 3 files changed, 359 insertions(+), 345 deletions(-) create mode 100644 Documentation/networking/l2tp.rst delete mode 100644 Documentation/networking/l2tp.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index e1ff08b94d90..0c5d7a037983 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -74,6 +74,7 @@ Contents: ipvlan ipvs-sysctl kcm + l2tp .. only:: subproject and html diff --git a/Documentation/networking/l2tp.rst b/Documentation/networking/l2tp.rst new file mode 100644 index 000000000000..a48238a2ec09 --- /dev/null +++ b/Documentation/networking/l2tp.rst @@ -0,0 +1,358 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==== +L2TP +==== + +This document describes how to use the kernel's L2TP drivers to +provide L2TP functionality. L2TP is a protocol that tunnels one or +more sessions over an IP tunnel. It is commonly used for VPNs +(L2TP/IPSec) and by ISPs to tunnel subscriber PPP sessions over an IP +network infrastructure. With L2TPv3, it is also useful as a Layer-2 +tunneling infrastructure. + +Features +======== + +L2TPv2 (PPP over L2TP (UDP tunnels)). +L2TPv3 ethernet pseudowires. +L2TPv3 PPP pseudowires. +L2TPv3 IP encapsulation. +Netlink sockets for L2TPv3 configuration management. + +History +======= + +The original pppol2tp driver was introduced in 2.6.23 and provided +L2TPv2 functionality (rfc2661). L2TPv2 is used to tunnel one or more PPP +sessions over a UDP tunnel. + +L2TPv3 (rfc3931) changes the protocol to allow different frame types +to be passed over an L2TP tunnel by moving the PPP-specific parts of +the protocol out of the core L2TP packet headers. Each frame type is +known as a pseudowire type. Ethernet, PPP, HDLC, Frame Relay and ATM +pseudowires for L2TP are defined in separate RFC standards. Another +change for L2TPv3 is that it can be carried directly over IP with no +UDP header (UDP is optional). It is also possible to create static +unmanaged L2TPv3 tunnels manually without a control protocol +(userspace daemon) to manage them. + +To support L2TPv3, the original pppol2tp driver was split up to +separate the L2TP and PPP functionality. Existing L2TPv2 userspace +apps should be unaffected as the original pppol2tp sockets API is +retained. L2TPv3, however, uses netlink to manage L2TPv3 tunnels and +sessions. + +Design +====== + +The L2TP protocol separates control and data frames. The L2TP kernel +drivers handle only L2TP data frames; control frames are always +handled by userspace. L2TP control frames carry messages between L2TP +clients/servers and are used to setup / teardown tunnels and +sessions. An L2TP client or server is implemented in userspace. + +Each L2TP tunnel is implemented using a UDP or L2TPIP socket; L2TPIP +provides L2TPv3 IP encapsulation (no UDP) and is implemented using a +new l2tpip socket family. The tunnel socket is typically created by +userspace, though for unmanaged L2TPv3 tunnels, the socket can also be +created by the kernel. Each L2TP session (pseudowire) gets a network +interface instance. In the case of PPP, these interfaces are created +indirectly by pppd using a pppol2tp socket. In the case of ethernet, +the netdevice is created upon a netlink request to create an L2TPv3 +ethernet pseudowire. + +For PPP, the PPPoL2TP driver, net/l2tp/l2tp_ppp.c, provides a +mechanism by which PPP frames carried through an L2TP session are +passed through the kernel's PPP subsystem. The standard PPP daemon, +pppd, handles all PPP interaction with the peer. PPP network +interfaces are created for each local PPP endpoint. The kernel's PPP +subsystem arranges for PPP control frames to be delivered to pppd, +while data frames are forwarded as usual. + +For ethernet, the L2TPETH driver, net/l2tp/l2tp_eth.c, implements a +netdevice driver, managing virtual ethernet devices, one per +pseudowire. These interfaces can be managed using standard Linux tools +such as "ip" and "ifconfig". If only IP frames are passed over the +tunnel, the interface can be given an IP addresses of itself and its +peer. If non-IP frames are to be passed over the tunnel, the interface +can be added to a bridge using brctl. All L2TP datapath protocol +functions are handled by the L2TP core driver. + +Each tunnel and session within a tunnel is assigned a unique tunnel_id +and session_id. These ids are carried in the L2TP header of every +control and data packet. (Actually, in L2TPv3, the tunnel_id isn't +present in data frames - it is inferred from the IP connection on +which the packet was received.) The L2TP driver uses the ids to lookup +internal tunnel and/or session contexts to determine how to handle the +packet. Zero tunnel / session ids are treated specially - zero ids are +never assigned to tunnels or sessions in the network. In the driver, +the tunnel context keeps a reference to the tunnel UDP or L2TPIP +socket. The session context holds data that lets the driver interface +to the kernel's network frame type subsystems, i.e. PPP, ethernet. + +Userspace Programming +===================== + +For L2TPv2, there are a number of requirements on the userspace L2TP +daemon in order to use the pppol2tp driver. + +1. Use a UDP socket per tunnel. + +2. Create a single PPPoL2TP socket per tunnel bound to a special null + session id. This is used only for communicating with the driver but + must remain open while the tunnel is active. Opening this tunnel + management socket causes the driver to mark the tunnel socket as an + L2TP UDP encapsulation socket and flags it for use by the + referenced tunnel id. This hooks up the UDP receive path via + udp_encap_rcv() in net/ipv4/udp.c. PPP data frames are never passed + in this special PPPoX socket. + +3. Create a PPPoL2TP socket per L2TP session. This is typically done + by starting pppd with the pppol2tp plugin and appropriate + arguments. A PPPoL2TP tunnel management socket (Step 2) must be + created before the first PPPoL2TP session socket is created. + +When creating PPPoL2TP sockets, the application provides information +to the driver about the socket in a socket connect() call. Source and +destination tunnel and session ids are provided, as well as the file +descriptor of a UDP socket. See struct pppol2tp_addr in +include/linux/if_pppol2tp.h. Note that zero tunnel / session ids are +treated specially. When creating the per-tunnel PPPoL2TP management +socket in Step 2 above, zero source and destination session ids are +specified, which tells the driver to prepare the supplied UDP file +descriptor for use as an L2TP tunnel socket. + +Userspace may control behavior of the tunnel or session using +setsockopt and ioctl on the PPPoX socket. The following socket +options are supported:- + +========= =========================================================== +DEBUG bitmask of debug message categories. See below. +SENDSEQ - 0 => don't send packets with sequence numbers + - 1 => send packets with sequence numbers +RECVSEQ - 0 => receive packet sequence numbers are optional + - 1 => drop receive packets without sequence numbers +LNSMODE - 0 => act as LAC. + - 1 => act as LNS. +REORDERTO reorder timeout (in millisecs). If 0, don't try to reorder. +========= =========================================================== + +Only the DEBUG option is supported by the special tunnel management +PPPoX socket. + +In addition to the standard PPP ioctls, a PPPIOCGL2TPSTATS is provided +to retrieve tunnel and session statistics from the kernel using the +PPPoX socket of the appropriate tunnel or session. + +For L2TPv3, userspace must use the netlink API defined in +include/linux/l2tp.h to manage tunnel and session contexts. The +general procedure to create a new L2TP tunnel with one session is:- + +1. Open a GENL socket using L2TP_GENL_NAME for configuring the kernel + using netlink. + +2. Create a UDP or L2TPIP socket for the tunnel. + +3. Create a new L2TP tunnel using a L2TP_CMD_TUNNEL_CREATE + request. Set attributes according to desired tunnel parameters, + referencing the UDP or L2TPIP socket created in the previous step. + +4. Create a new L2TP session in the tunnel using a + L2TP_CMD_SESSION_CREATE request. + +The tunnel and all of its sessions are closed when the tunnel socket +is closed. The netlink API may also be used to delete sessions and +tunnels. Configuration and status info may be set or read using netlink. + +The L2TP driver also supports static (unmanaged) L2TPv3 tunnels. These +are where there is no L2TP control message exchange with the peer to +setup the tunnel; the tunnel is configured manually at each end of the +tunnel. There is no need for an L2TP userspace application in this +case -- the tunnel socket is created by the kernel and configured +using parameters sent in the L2TP_CMD_TUNNEL_CREATE netlink +request. The "ip" utility of iproute2 has commands for managing static +L2TPv3 tunnels; do "ip l2tp help" for more information. + +Debugging +========= + +The driver supports a flexible debug scheme where kernel trace +messages may be optionally enabled per tunnel and per session. Care is +needed when debugging a live system since the messages are not +rate-limited and a busy system could be swamped. Userspace uses +setsockopt on the PPPoX socket to set a debug mask. + +The following debug mask bits are available: + +================ ============================== +L2TP_MSG_DEBUG verbose debug (if compiled in) +L2TP_MSG_CONTROL userspace - kernel interface +L2TP_MSG_SEQ sequence numbers handling +L2TP_MSG_DATA data packets +================ ============================== + +If enabled, files under a l2tp debugfs directory can be used to dump +kernel state about L2TP tunnels and sessions. To access it, the +debugfs filesystem must first be mounted:: + + # mount -t debugfs debugfs /debug + +Files under the l2tp directory can then be accessed:: + + # cat /debug/l2tp/tunnels + +The debugfs files should not be used by applications to obtain L2TP +state information because the file format is subject to change. It is +implemented to provide extra debug information to help diagnose +problems.) Users should use the netlink API. + +/proc/net/pppol2tp is also provided for backwards compatibility with +the original pppol2tp driver. It lists information about L2TPv2 +tunnels and sessions only. Its use is discouraged. + +Unmanaged L2TPv3 Tunnels +======================== + +Some commercial L2TP products support unmanaged L2TPv3 ethernet +tunnels, where there is no L2TP control protocol; tunnels are +configured at each side manually. New commands are available in +iproute2's ip utility to support this. + +To create an L2TPv3 ethernet pseudowire between local host 192.168.1.1 +and peer 192.168.1.2, using IP addresses 10.5.1.1 and 10.5.1.2 for the +tunnel endpoints:: + + # ip l2tp add tunnel tunnel_id 1 peer_tunnel_id 1 udp_sport 5000 \ + udp_dport 5000 encap udp local 192.168.1.1 remote 192.168.1.2 + # ip l2tp add session tunnel_id 1 session_id 1 peer_session_id 1 + # ip -s -d show dev l2tpeth0 + # ip addr add 10.5.1.2/32 peer 10.5.1.1/32 dev l2tpeth0 + # ip li set dev l2tpeth0 up + +Choose IP addresses to be the address of a local IP interface and that +of the remote system. The IP addresses of the l2tpeth0 interface can be +anything suitable. + +Repeat the above at the peer, with ports, tunnel/session ids and IP +addresses reversed. The tunnel and session IDs can be any non-zero +32-bit number, but the values must be reversed at the peer. + +======================== =================== +Host 1 Host2 +======================== =================== +udp_sport=5000 udp_sport=5001 +udp_dport=5001 udp_dport=5000 +tunnel_id=42 tunnel_id=45 +peer_tunnel_id=45 peer_tunnel_id=42 +session_id=128 session_id=5196755 +peer_session_id=5196755 peer_session_id=128 +======================== =================== + +When done at both ends of the tunnel, it should be possible to send +data over the network. e.g.:: + + # ping 10.5.1.1 + + +Sample Userspace Code +===================== + +1. Create tunnel management PPPoX socket:: + + kernel_fd = socket(AF_PPPOX, SOCK_DGRAM, PX_PROTO_OL2TP); + if (kernel_fd >= 0) { + struct sockaddr_pppol2tp sax; + struct sockaddr_in const *peer_addr; + + peer_addr = l2tp_tunnel_get_peer_addr(tunnel); + memset(&sax, 0, sizeof(sax)); + sax.sa_family = AF_PPPOX; + sax.sa_protocol = PX_PROTO_OL2TP; + sax.pppol2tp.fd = udp_fd; /* fd of tunnel UDP socket */ + sax.pppol2tp.addr.sin_addr.s_addr = peer_addr->sin_addr.s_addr; + sax.pppol2tp.addr.sin_port = peer_addr->sin_port; + sax.pppol2tp.addr.sin_family = AF_INET; + sax.pppol2tp.s_tunnel = tunnel_id; + sax.pppol2tp.s_session = 0; /* special case: mgmt socket */ + sax.pppol2tp.d_tunnel = 0; + sax.pppol2tp.d_session = 0; /* special case: mgmt socket */ + + if(connect(kernel_fd, (struct sockaddr *)&sax, sizeof(sax) ) < 0 ) { + perror("connect failed"); + result = -errno; + goto err; + } + } + +2. Create session PPPoX data socket:: + + struct sockaddr_pppol2tp sax; + int fd; + + /* Note, the target socket must be bound already, else it will not be ready */ + sax.sa_family = AF_PPPOX; + sax.sa_protocol = PX_PROTO_OL2TP; + sax.pppol2tp.fd = tunnel_fd; + sax.pppol2tp.addr.sin_addr.s_addr = addr->sin_addr.s_addr; + sax.pppol2tp.addr.sin_port = addr->sin_port; + sax.pppol2tp.addr.sin_family = AF_INET; + sax.pppol2tp.s_tunnel = tunnel_id; + sax.pppol2tp.s_session = session_id; + sax.pppol2tp.d_tunnel = peer_tunnel_id; + sax.pppol2tp.d_session = peer_session_id; + + /* session_fd is the fd of the session's PPPoL2TP socket. + * tunnel_fd is the fd of the tunnel UDP socket. + */ + fd = connect(session_fd, (struct sockaddr *)&sax, sizeof(sax)); + if (fd < 0 ) { + return -errno; + } + return 0; + +Internal Implementation +======================= + +The driver keeps a struct l2tp_tunnel context per L2TP tunnel and a +struct l2tp_session context for each session. The l2tp_tunnel is +always associated with a UDP or L2TP/IP socket and keeps a list of +sessions in the tunnel. The l2tp_session context keeps kernel state +about the session. It has private data which is used for data specific +to the session type. With L2TPv2, the session always carried PPP +traffic. With L2TPv3, the session can also carry ethernet frames +(ethernet pseudowire) or other data types such as ATM, HDLC or Frame +Relay. + +When a tunnel is first opened, the reference count on the socket is +increased using sock_hold(). This ensures that the kernel socket +cannot be removed while L2TP's data structures reference it. + +Some L2TP sessions also have a socket (PPP pseudowires) while others +do not (ethernet pseudowires). We can't use the socket reference count +as the reference count for session contexts. The L2TP implementation +therefore has its own internal reference counts on the session +contexts. + +To Do +===== + +Add L2TP tunnel switching support. This would route tunneled traffic +from one L2TP tunnel into another. Specified in +http://tools.ietf.org/html/draft-ietf-l2tpext-tunnel-switching-08 + +Add L2TPv3 VLAN pseudowire support. + +Add L2TPv3 IP pseudowire support. + +Add L2TPv3 ATM pseudowire support. + +Miscellaneous +============= + +The L2TP drivers were developed as part of the OpenL2TP project by +Katalix Systems Ltd. OpenL2TP is a full-featured L2TP client / server, +designed from the ground up to have the L2TP datapath in the +kernel. The project also implemented the pppol2tp plugin for pppd +which allows pppd to use the kernel driver. Details can be found at +http://www.openl2tp.org. diff --git a/Documentation/networking/l2tp.txt b/Documentation/networking/l2tp.txt deleted file mode 100644 index 9bc271cdc9a8..000000000000 --- a/Documentation/networking/l2tp.txt +++ /dev/null @@ -1,345 +0,0 @@ -This document describes how to use the kernel's L2TP drivers to -provide L2TP functionality. L2TP is a protocol that tunnels one or -more sessions over an IP tunnel. It is commonly used for VPNs -(L2TP/IPSec) and by ISPs to tunnel subscriber PPP sessions over an IP -network infrastructure. With L2TPv3, it is also useful as a Layer-2 -tunneling infrastructure. - -Features -======== - -L2TPv2 (PPP over L2TP (UDP tunnels)). -L2TPv3 ethernet pseudowires. -L2TPv3 PPP pseudowires. -L2TPv3 IP encapsulation. -Netlink sockets for L2TPv3 configuration management. - -History -======= - -The original pppol2tp driver was introduced in 2.6.23 and provided -L2TPv2 functionality (rfc2661). L2TPv2 is used to tunnel one or more PPP -sessions over a UDP tunnel. - -L2TPv3 (rfc3931) changes the protocol to allow different frame types -to be passed over an L2TP tunnel by moving the PPP-specific parts of -the protocol out of the core L2TP packet headers. Each frame type is -known as a pseudowire type. Ethernet, PPP, HDLC, Frame Relay and ATM -pseudowires for L2TP are defined in separate RFC standards. Another -change for L2TPv3 is that it can be carried directly over IP with no -UDP header (UDP is optional). It is also possible to create static -unmanaged L2TPv3 tunnels manually without a control protocol -(userspace daemon) to manage them. - -To support L2TPv3, the original pppol2tp driver was split up to -separate the L2TP and PPP functionality. Existing L2TPv2 userspace -apps should be unaffected as the original pppol2tp sockets API is -retained. L2TPv3, however, uses netlink to manage L2TPv3 tunnels and -sessions. - -Design -====== - -The L2TP protocol separates control and data frames. The L2TP kernel -drivers handle only L2TP data frames; control frames are always -handled by userspace. L2TP control frames carry messages between L2TP -clients/servers and are used to setup / teardown tunnels and -sessions. An L2TP client or server is implemented in userspace. - -Each L2TP tunnel is implemented using a UDP or L2TPIP socket; L2TPIP -provides L2TPv3 IP encapsulation (no UDP) and is implemented using a -new l2tpip socket family. The tunnel socket is typically created by -userspace, though for unmanaged L2TPv3 tunnels, the socket can also be -created by the kernel. Each L2TP session (pseudowire) gets a network -interface instance. In the case of PPP, these interfaces are created -indirectly by pppd using a pppol2tp socket. In the case of ethernet, -the netdevice is created upon a netlink request to create an L2TPv3 -ethernet pseudowire. - -For PPP, the PPPoL2TP driver, net/l2tp/l2tp_ppp.c, provides a -mechanism by which PPP frames carried through an L2TP session are -passed through the kernel's PPP subsystem. The standard PPP daemon, -pppd, handles all PPP interaction with the peer. PPP network -interfaces are created for each local PPP endpoint. The kernel's PPP -subsystem arranges for PPP control frames to be delivered to pppd, -while data frames are forwarded as usual. - -For ethernet, the L2TPETH driver, net/l2tp/l2tp_eth.c, implements a -netdevice driver, managing virtual ethernet devices, one per -pseudowire. These interfaces can be managed using standard Linux tools -such as "ip" and "ifconfig". If only IP frames are passed over the -tunnel, the interface can be given an IP addresses of itself and its -peer. If non-IP frames are to be passed over the tunnel, the interface -can be added to a bridge using brctl. All L2TP datapath protocol -functions are handled by the L2TP core driver. - -Each tunnel and session within a tunnel is assigned a unique tunnel_id -and session_id. These ids are carried in the L2TP header of every -control and data packet. (Actually, in L2TPv3, the tunnel_id isn't -present in data frames - it is inferred from the IP connection on -which the packet was received.) The L2TP driver uses the ids to lookup -internal tunnel and/or session contexts to determine how to handle the -packet. Zero tunnel / session ids are treated specially - zero ids are -never assigned to tunnels or sessions in the network. In the driver, -the tunnel context keeps a reference to the tunnel UDP or L2TPIP -socket. The session context holds data that lets the driver interface -to the kernel's network frame type subsystems, i.e. PPP, ethernet. - -Userspace Programming -===================== - -For L2TPv2, there are a number of requirements on the userspace L2TP -daemon in order to use the pppol2tp driver. - -1. Use a UDP socket per tunnel. - -2. Create a single PPPoL2TP socket per tunnel bound to a special null - session id. This is used only for communicating with the driver but - must remain open while the tunnel is active. Opening this tunnel - management socket causes the driver to mark the tunnel socket as an - L2TP UDP encapsulation socket and flags it for use by the - referenced tunnel id. This hooks up the UDP receive path via - udp_encap_rcv() in net/ipv4/udp.c. PPP data frames are never passed - in this special PPPoX socket. - -3. Create a PPPoL2TP socket per L2TP session. This is typically done - by starting pppd with the pppol2tp plugin and appropriate - arguments. A PPPoL2TP tunnel management socket (Step 2) must be - created before the first PPPoL2TP session socket is created. - -When creating PPPoL2TP sockets, the application provides information -to the driver about the socket in a socket connect() call. Source and -destination tunnel and session ids are provided, as well as the file -descriptor of a UDP socket. See struct pppol2tp_addr in -include/linux/if_pppol2tp.h. Note that zero tunnel / session ids are -treated specially. When creating the per-tunnel PPPoL2TP management -socket in Step 2 above, zero source and destination session ids are -specified, which tells the driver to prepare the supplied UDP file -descriptor for use as an L2TP tunnel socket. - -Userspace may control behavior of the tunnel or session using -setsockopt and ioctl on the PPPoX socket. The following socket -options are supported:- - -DEBUG - bitmask of debug message categories. See below. -SENDSEQ - 0 => don't send packets with sequence numbers - 1 => send packets with sequence numbers -RECVSEQ - 0 => receive packet sequence numbers are optional - 1 => drop receive packets without sequence numbers -LNSMODE - 0 => act as LAC. - 1 => act as LNS. -REORDERTO - reorder timeout (in millisecs). If 0, don't try to reorder. - -Only the DEBUG option is supported by the special tunnel management -PPPoX socket. - -In addition to the standard PPP ioctls, a PPPIOCGL2TPSTATS is provided -to retrieve tunnel and session statistics from the kernel using the -PPPoX socket of the appropriate tunnel or session. - -For L2TPv3, userspace must use the netlink API defined in -include/linux/l2tp.h to manage tunnel and session contexts. The -general procedure to create a new L2TP tunnel with one session is:- - -1. Open a GENL socket using L2TP_GENL_NAME for configuring the kernel - using netlink. - -2. Create a UDP or L2TPIP socket for the tunnel. - -3. Create a new L2TP tunnel using a L2TP_CMD_TUNNEL_CREATE - request. Set attributes according to desired tunnel parameters, - referencing the UDP or L2TPIP socket created in the previous step. - -4. Create a new L2TP session in the tunnel using a - L2TP_CMD_SESSION_CREATE request. - -The tunnel and all of its sessions are closed when the tunnel socket -is closed. The netlink API may also be used to delete sessions and -tunnels. Configuration and status info may be set or read using netlink. - -The L2TP driver also supports static (unmanaged) L2TPv3 tunnels. These -are where there is no L2TP control message exchange with the peer to -setup the tunnel; the tunnel is configured manually at each end of the -tunnel. There is no need for an L2TP userspace application in this -case -- the tunnel socket is created by the kernel and configured -using parameters sent in the L2TP_CMD_TUNNEL_CREATE netlink -request. The "ip" utility of iproute2 has commands for managing static -L2TPv3 tunnels; do "ip l2tp help" for more information. - -Debugging -========= - -The driver supports a flexible debug scheme where kernel trace -messages may be optionally enabled per tunnel and per session. Care is -needed when debugging a live system since the messages are not -rate-limited and a busy system could be swamped. Userspace uses -setsockopt on the PPPoX socket to set a debug mask. - -The following debug mask bits are available: - -L2TP_MSG_DEBUG verbose debug (if compiled in) -L2TP_MSG_CONTROL userspace - kernel interface -L2TP_MSG_SEQ sequence numbers handling -L2TP_MSG_DATA data packets - -If enabled, files under a l2tp debugfs directory can be used to dump -kernel state about L2TP tunnels and sessions. To access it, the -debugfs filesystem must first be mounted. - -# mount -t debugfs debugfs /debug - -Files under the l2tp directory can then be accessed. - -# cat /debug/l2tp/tunnels - -The debugfs files should not be used by applications to obtain L2TP -state information because the file format is subject to change. It is -implemented to provide extra debug information to help diagnose -problems.) Users should use the netlink API. - -/proc/net/pppol2tp is also provided for backwards compatibility with -the original pppol2tp driver. It lists information about L2TPv2 -tunnels and sessions only. Its use is discouraged. - -Unmanaged L2TPv3 Tunnels -======================== - -Some commercial L2TP products support unmanaged L2TPv3 ethernet -tunnels, where there is no L2TP control protocol; tunnels are -configured at each side manually. New commands are available in -iproute2's ip utility to support this. - -To create an L2TPv3 ethernet pseudowire between local host 192.168.1.1 -and peer 192.168.1.2, using IP addresses 10.5.1.1 and 10.5.1.2 for the -tunnel endpoints:- - -# ip l2tp add tunnel tunnel_id 1 peer_tunnel_id 1 udp_sport 5000 \ - udp_dport 5000 encap udp local 192.168.1.1 remote 192.168.1.2 -# ip l2tp add session tunnel_id 1 session_id 1 peer_session_id 1 -# ip -s -d show dev l2tpeth0 -# ip addr add 10.5.1.2/32 peer 10.5.1.1/32 dev l2tpeth0 -# ip li set dev l2tpeth0 up - -Choose IP addresses to be the address of a local IP interface and that -of the remote system. The IP addresses of the l2tpeth0 interface can be -anything suitable. - -Repeat the above at the peer, with ports, tunnel/session ids and IP -addresses reversed. The tunnel and session IDs can be any non-zero -32-bit number, but the values must be reversed at the peer. - -Host 1 Host2 -udp_sport=5000 udp_sport=5001 -udp_dport=5001 udp_dport=5000 -tunnel_id=42 tunnel_id=45 -peer_tunnel_id=45 peer_tunnel_id=42 -session_id=128 session_id=5196755 -peer_session_id=5196755 peer_session_id=128 - -When done at both ends of the tunnel, it should be possible to send -data over the network. e.g. - -# ping 10.5.1.1 - - -Sample Userspace Code -===================== - -1. Create tunnel management PPPoX socket - - kernel_fd = socket(AF_PPPOX, SOCK_DGRAM, PX_PROTO_OL2TP); - if (kernel_fd >= 0) { - struct sockaddr_pppol2tp sax; - struct sockaddr_in const *peer_addr; - - peer_addr = l2tp_tunnel_get_peer_addr(tunnel); - memset(&sax, 0, sizeof(sax)); - sax.sa_family = AF_PPPOX; - sax.sa_protocol = PX_PROTO_OL2TP; - sax.pppol2tp.fd = udp_fd; /* fd of tunnel UDP socket */ - sax.pppol2tp.addr.sin_addr.s_addr = peer_addr->sin_addr.s_addr; - sax.pppol2tp.addr.sin_port = peer_addr->sin_port; - sax.pppol2tp.addr.sin_family = AF_INET; - sax.pppol2tp.s_tunnel = tunnel_id; - sax.pppol2tp.s_session = 0; /* special case: mgmt socket */ - sax.pppol2tp.d_tunnel = 0; - sax.pppol2tp.d_session = 0; /* special case: mgmt socket */ - - if(connect(kernel_fd, (struct sockaddr *)&sax, sizeof(sax) ) < 0 ) { - perror("connect failed"); - result = -errno; - goto err; - } - } - -2. Create session PPPoX data socket - - struct sockaddr_pppol2tp sax; - int fd; - - /* Note, the target socket must be bound already, else it will not be ready */ - sax.sa_family = AF_PPPOX; - sax.sa_protocol = PX_PROTO_OL2TP; - sax.pppol2tp.fd = tunnel_fd; - sax.pppol2tp.addr.sin_addr.s_addr = addr->sin_addr.s_addr; - sax.pppol2tp.addr.sin_port = addr->sin_port; - sax.pppol2tp.addr.sin_family = AF_INET; - sax.pppol2tp.s_tunnel = tunnel_id; - sax.pppol2tp.s_session = session_id; - sax.pppol2tp.d_tunnel = peer_tunnel_id; - sax.pppol2tp.d_session = peer_session_id; - - /* session_fd is the fd of the session's PPPoL2TP socket. - * tunnel_fd is the fd of the tunnel UDP socket. - */ - fd = connect(session_fd, (struct sockaddr *)&sax, sizeof(sax)); - if (fd < 0 ) { - return -errno; - } - return 0; - -Internal Implementation -======================= - -The driver keeps a struct l2tp_tunnel context per L2TP tunnel and a -struct l2tp_session context for each session. The l2tp_tunnel is -always associated with a UDP or L2TP/IP socket and keeps a list of -sessions in the tunnel. The l2tp_session context keeps kernel state -about the session. It has private data which is used for data specific -to the session type. With L2TPv2, the session always carried PPP -traffic. With L2TPv3, the session can also carry ethernet frames -(ethernet pseudowire) or other data types such as ATM, HDLC or Frame -Relay. - -When a tunnel is first opened, the reference count on the socket is -increased using sock_hold(). This ensures that the kernel socket -cannot be removed while L2TP's data structures reference it. - -Some L2TP sessions also have a socket (PPP pseudowires) while others -do not (ethernet pseudowires). We can't use the socket reference count -as the reference count for session contexts. The L2TP implementation -therefore has its own internal reference counts on the session -contexts. - -To Do -===== - -Add L2TP tunnel switching support. This would route tunneled traffic -from one L2TP tunnel into another. Specified in -http://tools.ietf.org/html/draft-ietf-l2tpext-tunnel-switching-08 - -Add L2TPv3 VLAN pseudowire support. - -Add L2TPv3 IP pseudowire support. - -Add L2TPv3 ATM pseudowire support. - -Miscellaneous -============= - -The L2TP drivers were developed as part of the OpenL2TP project by -Katalix Systems Ltd. OpenL2TP is a full-featured L2TP client / server, -designed from the ground up to have the L2TP datapath in the -kernel. The project also implemented the pppol2tp plugin for pppd -which allows pppd to use the kernel driver. Details can be found at -http://www.openl2tp.org. -- cgit v1.2.3 From 40e79150c1686263e6a031d7702aec63aff31332 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:03:57 +0200 Subject: docs: networking: convert lapb-module.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/lapb-module.rst | 305 +++++++++++++++++++++++++++++++ Documentation/networking/lapb-module.txt | 263 -------------------------- MAINTAINERS | 2 +- net/lapb/Kconfig | 2 +- 5 files changed, 308 insertions(+), 265 deletions(-) create mode 100644 Documentation/networking/lapb-module.rst delete mode 100644 Documentation/networking/lapb-module.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 0c5d7a037983..acd2567cf0d4 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -75,6 +75,7 @@ Contents: ipvs-sysctl kcm l2tp + lapb-module .. only:: subproject and html diff --git a/Documentation/networking/lapb-module.rst b/Documentation/networking/lapb-module.rst new file mode 100644 index 000000000000..ff586bc9f005 --- /dev/null +++ b/Documentation/networking/lapb-module.rst @@ -0,0 +1,305 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============================== +The Linux LAPB Module Interface +=============================== + +Version 1.3 + +Jonathan Naylor 29.12.96 + +Changed (Henner Eisen, 2000-10-29): int return value for data_indication() + +The LAPB module will be a separately compiled module for use by any parts of +the Linux operating system that require a LAPB service. This document +defines the interfaces to, and the services provided by this module. The +term module in this context does not imply that the LAPB module is a +separately loadable module, although it may be. The term module is used in +its more standard meaning. + +The interface to the LAPB module consists of functions to the module, +callbacks from the module to indicate important state changes, and +structures for getting and setting information about the module. + +Structures +---------- + +Probably the most important structure is the skbuff structure for holding +received and transmitted data, however it is beyond the scope of this +document. + +The two LAPB specific structures are the LAPB initialisation structure and +the LAPB parameter structure. These will be defined in a standard header +file, . The header file is internal to the LAPB +module and is not for use. + +LAPB Initialisation Structure +----------------------------- + +This structure is used only once, in the call to lapb_register (see below). +It contains information about the device driver that requires the services +of the LAPB module:: + + struct lapb_register_struct { + void (*connect_confirmation)(int token, int reason); + void (*connect_indication)(int token, int reason); + void (*disconnect_confirmation)(int token, int reason); + void (*disconnect_indication)(int token, int reason); + int (*data_indication)(int token, struct sk_buff *skb); + void (*data_transmit)(int token, struct sk_buff *skb); + }; + +Each member of this structure corresponds to a function in the device driver +that is called when a particular event in the LAPB module occurs. These will +be described in detail below. If a callback is not required (!!) then a NULL +may be substituted. + + +LAPB Parameter Structure +------------------------ + +This structure is used with the lapb_getparms and lapb_setparms functions +(see below). They are used to allow the device driver to get and set the +operational parameters of the LAPB implementation for a given connection:: + + struct lapb_parms_struct { + unsigned int t1; + unsigned int t1timer; + unsigned int t2; + unsigned int t2timer; + unsigned int n2; + unsigned int n2count; + unsigned int window; + unsigned int state; + unsigned int mode; + }; + +T1 and T2 are protocol timing parameters and are given in units of 100ms. N2 +is the maximum number of tries on the link before it is declared a failure. +The window size is the maximum number of outstanding data packets allowed to +be unacknowledged by the remote end, the value of the window is between 1 +and 7 for a standard LAPB link, and between 1 and 127 for an extended LAPB +link. + +The mode variable is a bit field used for setting (at present) three values. +The bit fields have the following meanings: + +====== ================================================= +Bit Meaning +====== ================================================= +0 LAPB operation (0=LAPB_STANDARD 1=LAPB_EXTENDED). +1 [SM]LP operation (0=LAPB_SLP 1=LAPB=MLP). +2 DTE/DCE operation (0=LAPB_DTE 1=LAPB_DCE) +3-31 Reserved, must be 0. +====== ================================================= + +Extended LAPB operation indicates the use of extended sequence numbers and +consequently larger window sizes, the default is standard LAPB operation. +MLP operation is the same as SLP operation except that the addresses used by +LAPB are different to indicate the mode of operation, the default is Single +Link Procedure. The difference between DCE and DTE operation is (i) the +addresses used for commands and responses, and (ii) when the DCE is not +connected, it sends DM without polls set, every T1. The upper case constant +names will be defined in the public LAPB header file. + + +Functions +--------- + +The LAPB module provides a number of function entry points. + +:: + + int lapb_register(void *token, struct lapb_register_struct); + +This must be called before the LAPB module may be used. If the call is +successful then LAPB_OK is returned. The token must be a unique identifier +generated by the device driver to allow for the unique identification of the +instance of the LAPB link. It is returned by the LAPB module in all of the +callbacks, and is used by the device driver in all calls to the LAPB module. +For multiple LAPB links in a single device driver, multiple calls to +lapb_register must be made. The format of the lapb_register_struct is given +above. The return values are: + +============= ============================= +LAPB_OK LAPB registered successfully. +LAPB_BADTOKEN Token is already registered. +LAPB_NOMEM Out of memory +============= ============================= + +:: + + int lapb_unregister(void *token); + +This releases all the resources associated with a LAPB link. Any current +LAPB link will be abandoned without further messages being passed. After +this call, the value of token is no longer valid for any calls to the LAPB +function. The valid return values are: + +============= =============================== +LAPB_OK LAPB unregistered successfully. +LAPB_BADTOKEN Invalid/unknown LAPB token. +============= =============================== + +:: + + int lapb_getparms(void *token, struct lapb_parms_struct *parms); + +This allows the device driver to get the values of the current LAPB +variables, the lapb_parms_struct is described above. The valid return values +are: + +============= ============================= +LAPB_OK LAPB getparms was successful. +LAPB_BADTOKEN Invalid/unknown LAPB token. +============= ============================= + +:: + + int lapb_setparms(void *token, struct lapb_parms_struct *parms); + +This allows the device driver to set the values of the current LAPB +variables, the lapb_parms_struct is described above. The values of t1timer, +t2timer and n2count are ignored, likewise changing the mode bits when +connected will be ignored. An error implies that none of the values have +been changed. The valid return values are: + +============= ================================================= +LAPB_OK LAPB getparms was successful. +LAPB_BADTOKEN Invalid/unknown LAPB token. +LAPB_INVALUE One of the values was out of its allowable range. +============= ================================================= + +:: + + int lapb_connect_request(void *token); + +Initiate a connect using the current parameter settings. The valid return +values are: + +============== ================================= +LAPB_OK LAPB is starting to connect. +LAPB_BADTOKEN Invalid/unknown LAPB token. +LAPB_CONNECTED LAPB module is already connected. +============== ================================= + +:: + + int lapb_disconnect_request(void *token); + +Initiate a disconnect. The valid return values are: + +================= =============================== +LAPB_OK LAPB is starting to disconnect. +LAPB_BADTOKEN Invalid/unknown LAPB token. +LAPB_NOTCONNECTED LAPB module is not connected. +================= =============================== + +:: + + int lapb_data_request(void *token, struct sk_buff *skb); + +Queue data with the LAPB module for transmitting over the link. If the call +is successful then the skbuff is owned by the LAPB module and may not be +used by the device driver again. The valid return values are: + +================= ============================= +LAPB_OK LAPB has accepted the data. +LAPB_BADTOKEN Invalid/unknown LAPB token. +LAPB_NOTCONNECTED LAPB module is not connected. +================= ============================= + +:: + + int lapb_data_received(void *token, struct sk_buff *skb); + +Queue data with the LAPB module which has been received from the device. It +is expected that the data passed to the LAPB module has skb->data pointing +to the beginning of the LAPB data. If the call is successful then the skbuff +is owned by the LAPB module and may not be used by the device driver again. +The valid return values are: + +============= =========================== +LAPB_OK LAPB has accepted the data. +LAPB_BADTOKEN Invalid/unknown LAPB token. +============= =========================== + +Callbacks +--------- + +These callbacks are functions provided by the device driver for the LAPB +module to call when an event occurs. They are registered with the LAPB +module with lapb_register (see above) in the structure lapb_register_struct +(see above). + +:: + + void (*connect_confirmation)(void *token, int reason); + +This is called by the LAPB module when a connection is established after +being requested by a call to lapb_connect_request (see above). The reason is +always LAPB_OK. + +:: + + void (*connect_indication)(void *token, int reason); + +This is called by the LAPB module when the link is established by the remote +system. The value of reason is always LAPB_OK. + +:: + + void (*disconnect_confirmation)(void *token, int reason); + +This is called by the LAPB module when an event occurs after the device +driver has called lapb_disconnect_request (see above). The reason indicates +what has happened. In all cases the LAPB link can be regarded as being +terminated. The values for reason are: + +================= ==================================================== +LAPB_OK The LAPB link was terminated normally. +LAPB_NOTCONNECTED The remote system was not connected. +LAPB_TIMEDOUT No response was received in N2 tries from the remote + system. +================= ==================================================== + +:: + + void (*disconnect_indication)(void *token, int reason); + +This is called by the LAPB module when the link is terminated by the remote +system or another event has occurred to terminate the link. This may be +returned in response to a lapb_connect_request (see above) if the remote +system refused the request. The values for reason are: + +================= ==================================================== +LAPB_OK The LAPB link was terminated normally by the remote + system. +LAPB_REFUSED The remote system refused the connect request. +LAPB_NOTCONNECTED The remote system was not connected. +LAPB_TIMEDOUT No response was received in N2 tries from the remote + system. +================= ==================================================== + +:: + + int (*data_indication)(void *token, struct sk_buff *skb); + +This is called by the LAPB module when data has been received from the +remote system that should be passed onto the next layer in the protocol +stack. The skbuff becomes the property of the device driver and the LAPB +module will not perform any more actions on it. The skb->data pointer will +be pointing to the first byte of data after the LAPB header. + +This method should return NET_RX_DROP (as defined in the header +file include/linux/netdevice.h) if and only if the frame was dropped +before it could be delivered to the upper layer. + +:: + + void (*data_transmit)(void *token, struct sk_buff *skb); + +This is called by the LAPB module when data is to be transmitted to the +remote system by the device driver. The skbuff becomes the property of the +device driver and the LAPB module will not perform any more actions on it. +The skb->data pointer will be pointing to the first byte of the LAPB header. diff --git a/Documentation/networking/lapb-module.txt b/Documentation/networking/lapb-module.txt deleted file mode 100644 index d4fc8f221559..000000000000 --- a/Documentation/networking/lapb-module.txt +++ /dev/null @@ -1,263 +0,0 @@ - The Linux LAPB Module Interface 1.3 - - Jonathan Naylor 29.12.96 - -Changed (Henner Eisen, 2000-10-29): int return value for data_indication() - -The LAPB module will be a separately compiled module for use by any parts of -the Linux operating system that require a LAPB service. This document -defines the interfaces to, and the services provided by this module. The -term module in this context does not imply that the LAPB module is a -separately loadable module, although it may be. The term module is used in -its more standard meaning. - -The interface to the LAPB module consists of functions to the module, -callbacks from the module to indicate important state changes, and -structures for getting and setting information about the module. - -Structures ----------- - -Probably the most important structure is the skbuff structure for holding -received and transmitted data, however it is beyond the scope of this -document. - -The two LAPB specific structures are the LAPB initialisation structure and -the LAPB parameter structure. These will be defined in a standard header -file, . The header file is internal to the LAPB -module and is not for use. - -LAPB Initialisation Structure ------------------------------ - -This structure is used only once, in the call to lapb_register (see below). -It contains information about the device driver that requires the services -of the LAPB module. - -struct lapb_register_struct { - void (*connect_confirmation)(int token, int reason); - void (*connect_indication)(int token, int reason); - void (*disconnect_confirmation)(int token, int reason); - void (*disconnect_indication)(int token, int reason); - int (*data_indication)(int token, struct sk_buff *skb); - void (*data_transmit)(int token, struct sk_buff *skb); -}; - -Each member of this structure corresponds to a function in the device driver -that is called when a particular event in the LAPB module occurs. These will -be described in detail below. If a callback is not required (!!) then a NULL -may be substituted. - - -LAPB Parameter Structure ------------------------- - -This structure is used with the lapb_getparms and lapb_setparms functions -(see below). They are used to allow the device driver to get and set the -operational parameters of the LAPB implementation for a given connection. - -struct lapb_parms_struct { - unsigned int t1; - unsigned int t1timer; - unsigned int t2; - unsigned int t2timer; - unsigned int n2; - unsigned int n2count; - unsigned int window; - unsigned int state; - unsigned int mode; -}; - -T1 and T2 are protocol timing parameters and are given in units of 100ms. N2 -is the maximum number of tries on the link before it is declared a failure. -The window size is the maximum number of outstanding data packets allowed to -be unacknowledged by the remote end, the value of the window is between 1 -and 7 for a standard LAPB link, and between 1 and 127 for an extended LAPB -link. - -The mode variable is a bit field used for setting (at present) three values. -The bit fields have the following meanings: - -Bit Meaning -0 LAPB operation (0=LAPB_STANDARD 1=LAPB_EXTENDED). -1 [SM]LP operation (0=LAPB_SLP 1=LAPB=MLP). -2 DTE/DCE operation (0=LAPB_DTE 1=LAPB_DCE) -3-31 Reserved, must be 0. - -Extended LAPB operation indicates the use of extended sequence numbers and -consequently larger window sizes, the default is standard LAPB operation. -MLP operation is the same as SLP operation except that the addresses used by -LAPB are different to indicate the mode of operation, the default is Single -Link Procedure. The difference between DCE and DTE operation is (i) the -addresses used for commands and responses, and (ii) when the DCE is not -connected, it sends DM without polls set, every T1. The upper case constant -names will be defined in the public LAPB header file. - - -Functions ---------- - -The LAPB module provides a number of function entry points. - - -int lapb_register(void *token, struct lapb_register_struct); - -This must be called before the LAPB module may be used. If the call is -successful then LAPB_OK is returned. The token must be a unique identifier -generated by the device driver to allow for the unique identification of the -instance of the LAPB link. It is returned by the LAPB module in all of the -callbacks, and is used by the device driver in all calls to the LAPB module. -For multiple LAPB links in a single device driver, multiple calls to -lapb_register must be made. The format of the lapb_register_struct is given -above. The return values are: - -LAPB_OK LAPB registered successfully. -LAPB_BADTOKEN Token is already registered. -LAPB_NOMEM Out of memory - - -int lapb_unregister(void *token); - -This releases all the resources associated with a LAPB link. Any current -LAPB link will be abandoned without further messages being passed. After -this call, the value of token is no longer valid for any calls to the LAPB -function. The valid return values are: - -LAPB_OK LAPB unregistered successfully. -LAPB_BADTOKEN Invalid/unknown LAPB token. - - -int lapb_getparms(void *token, struct lapb_parms_struct *parms); - -This allows the device driver to get the values of the current LAPB -variables, the lapb_parms_struct is described above. The valid return values -are: - -LAPB_OK LAPB getparms was successful. -LAPB_BADTOKEN Invalid/unknown LAPB token. - - -int lapb_setparms(void *token, struct lapb_parms_struct *parms); - -This allows the device driver to set the values of the current LAPB -variables, the lapb_parms_struct is described above. The values of t1timer, -t2timer and n2count are ignored, likewise changing the mode bits when -connected will be ignored. An error implies that none of the values have -been changed. The valid return values are: - -LAPB_OK LAPB getparms was successful. -LAPB_BADTOKEN Invalid/unknown LAPB token. -LAPB_INVALUE One of the values was out of its allowable range. - - -int lapb_connect_request(void *token); - -Initiate a connect using the current parameter settings. The valid return -values are: - -LAPB_OK LAPB is starting to connect. -LAPB_BADTOKEN Invalid/unknown LAPB token. -LAPB_CONNECTED LAPB module is already connected. - - -int lapb_disconnect_request(void *token); - -Initiate a disconnect. The valid return values are: - -LAPB_OK LAPB is starting to disconnect. -LAPB_BADTOKEN Invalid/unknown LAPB token. -LAPB_NOTCONNECTED LAPB module is not connected. - - -int lapb_data_request(void *token, struct sk_buff *skb); - -Queue data with the LAPB module for transmitting over the link. If the call -is successful then the skbuff is owned by the LAPB module and may not be -used by the device driver again. The valid return values are: - -LAPB_OK LAPB has accepted the data. -LAPB_BADTOKEN Invalid/unknown LAPB token. -LAPB_NOTCONNECTED LAPB module is not connected. - - -int lapb_data_received(void *token, struct sk_buff *skb); - -Queue data with the LAPB module which has been received from the device. It -is expected that the data passed to the LAPB module has skb->data pointing -to the beginning of the LAPB data. If the call is successful then the skbuff -is owned by the LAPB module and may not be used by the device driver again. -The valid return values are: - -LAPB_OK LAPB has accepted the data. -LAPB_BADTOKEN Invalid/unknown LAPB token. - - -Callbacks ---------- - -These callbacks are functions provided by the device driver for the LAPB -module to call when an event occurs. They are registered with the LAPB -module with lapb_register (see above) in the structure lapb_register_struct -(see above). - - -void (*connect_confirmation)(void *token, int reason); - -This is called by the LAPB module when a connection is established after -being requested by a call to lapb_connect_request (see above). The reason is -always LAPB_OK. - - -void (*connect_indication)(void *token, int reason); - -This is called by the LAPB module when the link is established by the remote -system. The value of reason is always LAPB_OK. - - -void (*disconnect_confirmation)(void *token, int reason); - -This is called by the LAPB module when an event occurs after the device -driver has called lapb_disconnect_request (see above). The reason indicates -what has happened. In all cases the LAPB link can be regarded as being -terminated. The values for reason are: - -LAPB_OK The LAPB link was terminated normally. -LAPB_NOTCONNECTED The remote system was not connected. -LAPB_TIMEDOUT No response was received in N2 tries from the remote - system. - - -void (*disconnect_indication)(void *token, int reason); - -This is called by the LAPB module when the link is terminated by the remote -system or another event has occurred to terminate the link. This may be -returned in response to a lapb_connect_request (see above) if the remote -system refused the request. The values for reason are: - -LAPB_OK The LAPB link was terminated normally by the remote - system. -LAPB_REFUSED The remote system refused the connect request. -LAPB_NOTCONNECTED The remote system was not connected. -LAPB_TIMEDOUT No response was received in N2 tries from the remote - system. - - -int (*data_indication)(void *token, struct sk_buff *skb); - -This is called by the LAPB module when data has been received from the -remote system that should be passed onto the next layer in the protocol -stack. The skbuff becomes the property of the device driver and the LAPB -module will not perform any more actions on it. The skb->data pointer will -be pointing to the first byte of data after the LAPB header. - -This method should return NET_RX_DROP (as defined in the header -file include/linux/netdevice.h) if and only if the frame was dropped -before it could be delivered to the upper layer. - - -void (*data_transmit)(void *token, struct sk_buff *skb); - -This is called by the LAPB module when data is to be transmitted to the -remote system by the device driver. The skbuff becomes the property of the -device driver and the LAPB module will not perform any more actions on it. -The skb->data pointer will be pointing to the first byte of the LAPB header. diff --git a/MAINTAINERS b/MAINTAINERS index 3a5f52a3c055..956999d2d979 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9515,7 +9515,7 @@ F: drivers/soc/lantiq LAPB module L: linux-x25@vger.kernel.org S: Orphan -F: Documentation/networking/lapb-module.txt +F: Documentation/networking/lapb-module.rst F: include/*/lapb.h F: net/lapb/ diff --git a/net/lapb/Kconfig b/net/lapb/Kconfig index 6acfc999c952..5b50e8d64f26 100644 --- a/net/lapb/Kconfig +++ b/net/lapb/Kconfig @@ -15,7 +15,7 @@ config LAPB currently supports LAPB only over Ethernet connections. If you want to use LAPB connections over Ethernet, say Y here and to "LAPB over Ethernet driver" below. Read - for technical + for technical details. To compile this driver as a module, choose M here: the -- cgit v1.2.3 From a6b93e6555a6ecd0d08b0383ea4d93d09a168187 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:03:58 +0200 Subject: docs: networking: convert ltpc.txt to ReST - add SPDX header; - add a document title; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/ltpc.rst | 144 +++++++++++++++++++++++++++++++++++++ Documentation/networking/ltpc.txt | 131 --------------------------------- drivers/net/appletalk/Kconfig | 2 +- 4 files changed, 146 insertions(+), 132 deletions(-) create mode 100644 Documentation/networking/ltpc.rst delete mode 100644 Documentation/networking/ltpc.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index acd2567cf0d4..b3608b177a8b 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -76,6 +76,7 @@ Contents: kcm l2tp lapb-module + ltpc .. only:: subproject and html diff --git a/Documentation/networking/ltpc.rst b/Documentation/networking/ltpc.rst new file mode 100644 index 000000000000..0ad197fd17ce --- /dev/null +++ b/Documentation/networking/ltpc.rst @@ -0,0 +1,144 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========== +LTPC Driver +=========== + +This is the ALPHA version of the ltpc driver. + +In order to use it, you will need at least version 1.3.3 of the +netatalk package, and the Apple or Farallon LocalTalk PC card. +There are a number of different LocalTalk cards for the PC; this +driver applies only to the one with the 65c02 processor chip on it. + +To include it in the kernel, select the CONFIG_LTPC switch in the +configuration dialog. You can also compile it as a module. + +While the driver will attempt to autoprobe the I/O port address, IRQ +line, and DMA channel of the card, this does not always work. For +this reason, you should be prepared to supply these parameters +yourself. (see "Card Configuration" below for how to determine or +change the settings on your card) + +When the driver is compiled into the kernel, you can add a line such +as the following to your /etc/lilo.conf:: + + append="ltpc=0x240,9,1" + +where the parameters (in order) are the port address, IRQ, and DMA +channel. The second and third values can be omitted, in which case +the driver will try to determine them itself. + +If you load the driver as a module, you can pass the parameters "io=", +"irq=", and "dma=" on the command line with insmod or modprobe, or add +them as options in a configuration file in /etc/modprobe.d/ directory:: + + alias lt0 ltpc # autoload the module when the interface is configured + options ltpc io=0x240 irq=9 dma=1 + +Before starting up the netatalk demons (perhaps in rc.local), you +need to add a line such as:: + + /sbin/ifconfig lt0 127.0.0.42 + +The address is unimportant - however, the card needs to be configured +with ifconfig so that Netatalk can find it. + +The appropriate netatalk configuration depends on whether you are +attached to a network that includes AppleTalk routers or not. If, +like me, you are simply connecting to your home Macintoshes and +printers, you need to set up netatalk to "seed". The way I do this +is to have the lines:: + + dummy -seed -phase 2 -net 2000 -addr 2000.26 -zone "1033" + lt0 -seed -phase 1 -net 1033 -addr 1033.27 -zone "1033" + +in my atalkd.conf. What is going on here is that I need to fool +netatalk into thinking that there are two AppleTalk interfaces +present; otherwise, it refuses to seed. This is a hack, and a more +permanent solution would be to alter the netatalk code. Also, make +sure you have the correct name for the dummy interface - If it's +compiled as a module, you will need to refer to it as "dummy0" or some +such. + +If you are attached to an extended AppleTalk network, with routers on +it, then you don't need to fool around with this -- the appropriate +line in atalkd.conf is:: + + lt0 -phase 1 + + +Card Configuration +================== + +The interrupts and so forth are configured via the dipswitch on the +board. Set the switches so as not to conflict with other hardware. + + Interrupts -- set at most one. If none are set, the driver uses + polled mode. Because the card was developed in the XT era, the + original documentation refers to IRQ2. Since you'll be running + this on an AT (or later) class machine, that really means IRQ9. + + === =========================================================== + SW1 IRQ 4 + SW2 IRQ 3 + SW3 IRQ 9 (2 in original card documentation only applies to XT) + === =========================================================== + + + DMA -- choose DMA 1 or 3, and set both corresponding switches. + + === ===== + SW4 DMA 3 + SW5 DMA 1 + SW6 DMA 3 + SW7 DMA 1 + === ===== + + + I/O address -- choose one. + + === ========= + SW8 220 / 240 + === ========= + + +IP +== + +Yes, it is possible to do IP over LocalTalk. However, you can't just +treat the LocalTalk device like an ordinary Ethernet device, even if +that's what it looks like to Netatalk. + +Instead, you follow the same procedure as for doing IP in EtherTalk. +See Documentation/networking/ipddp.rst for more information about the +kernel driver and userspace tools needed. + + +Bugs +==== + +IRQ autoprobing often doesn't work on a cold boot. To get around +this, either compile the driver as a module, or pass the parameters +for the card to the kernel as described above. + +Also, as usual, autoprobing is not recommended when you use the driver +as a module. (though it usually works at boot time, at least) + +Polled mode is *really* slow sometimes, but this seems to depend on +the configuration of the network. + +It may theoretically be possible to use two LTPC cards in the same +machine, but this is unsupported, so if you really want to do this, +you'll probably have to hack the initialization code a bit. + + +Thanks +====== + +Thanks to Alan Cox for helpful discussions early on in this +work, and to Denis Hainsworth for doing the bleeding-edge testing. + +Bradford Johnson + +Updated 11/09/1998 by David Huggins-Daines diff --git a/Documentation/networking/ltpc.txt b/Documentation/networking/ltpc.txt deleted file mode 100644 index a005a73b76d0..000000000000 --- a/Documentation/networking/ltpc.txt +++ /dev/null @@ -1,131 +0,0 @@ -This is the ALPHA version of the ltpc driver. - -In order to use it, you will need at least version 1.3.3 of the -netatalk package, and the Apple or Farallon LocalTalk PC card. -There are a number of different LocalTalk cards for the PC; this -driver applies only to the one with the 65c02 processor chip on it. - -To include it in the kernel, select the CONFIG_LTPC switch in the -configuration dialog. You can also compile it as a module. - -While the driver will attempt to autoprobe the I/O port address, IRQ -line, and DMA channel of the card, this does not always work. For -this reason, you should be prepared to supply these parameters -yourself. (see "Card Configuration" below for how to determine or -change the settings on your card) - -When the driver is compiled into the kernel, you can add a line such -as the following to your /etc/lilo.conf: - - append="ltpc=0x240,9,1" - -where the parameters (in order) are the port address, IRQ, and DMA -channel. The second and third values can be omitted, in which case -the driver will try to determine them itself. - -If you load the driver as a module, you can pass the parameters "io=", -"irq=", and "dma=" on the command line with insmod or modprobe, or add -them as options in a configuration file in /etc/modprobe.d/ directory: - - alias lt0 ltpc # autoload the module when the interface is configured - options ltpc io=0x240 irq=9 dma=1 - -Before starting up the netatalk demons (perhaps in rc.local), you -need to add a line such as: - - /sbin/ifconfig lt0 127.0.0.42 - -The address is unimportant - however, the card needs to be configured -with ifconfig so that Netatalk can find it. - -The appropriate netatalk configuration depends on whether you are -attached to a network that includes AppleTalk routers or not. If, -like me, you are simply connecting to your home Macintoshes and -printers, you need to set up netatalk to "seed". The way I do this -is to have the lines - - dummy -seed -phase 2 -net 2000 -addr 2000.26 -zone "1033" - lt0 -seed -phase 1 -net 1033 -addr 1033.27 -zone "1033" - -in my atalkd.conf. What is going on here is that I need to fool -netatalk into thinking that there are two AppleTalk interfaces -present; otherwise, it refuses to seed. This is a hack, and a more -permanent solution would be to alter the netatalk code. Also, make -sure you have the correct name for the dummy interface - If it's -compiled as a module, you will need to refer to it as "dummy0" or some -such. - -If you are attached to an extended AppleTalk network, with routers on -it, then you don't need to fool around with this -- the appropriate -line in atalkd.conf is - - lt0 -phase 1 - --------------------------------------- - -Card Configuration: - -The interrupts and so forth are configured via the dipswitch on the -board. Set the switches so as not to conflict with other hardware. - - Interrupts -- set at most one. If none are set, the driver uses - polled mode. Because the card was developed in the XT era, the - original documentation refers to IRQ2. Since you'll be running - this on an AT (or later) class machine, that really means IRQ9. - - SW1 IRQ 4 - SW2 IRQ 3 - SW3 IRQ 9 (2 in original card documentation only applies to XT) - - - DMA -- choose DMA 1 or 3, and set both corresponding switches. - - SW4 DMA 3 - SW5 DMA 1 - SW6 DMA 3 - SW7 DMA 1 - - - I/O address -- choose one. - - SW8 220 / 240 - --------------------------------------- - -IP: - -Yes, it is possible to do IP over LocalTalk. However, you can't just -treat the LocalTalk device like an ordinary Ethernet device, even if -that's what it looks like to Netatalk. - -Instead, you follow the same procedure as for doing IP in EtherTalk. -See Documentation/networking/ipddp.rst for more information about the -kernel driver and userspace tools needed. - --------------------------------------- - -BUGS: - -IRQ autoprobing often doesn't work on a cold boot. To get around -this, either compile the driver as a module, or pass the parameters -for the card to the kernel as described above. - -Also, as usual, autoprobing is not recommended when you use the driver -as a module. (though it usually works at boot time, at least) - -Polled mode is *really* slow sometimes, but this seems to depend on -the configuration of the network. - -It may theoretically be possible to use two LTPC cards in the same -machine, but this is unsupported, so if you really want to do this, -you'll probably have to hack the initialization code a bit. - -______________________________________ - -THANKS: - Thanks to Alan Cox for helpful discussions early on in this -work, and to Denis Hainsworth for doing the bleeding-edge testing. - --- Bradford Johnson - --- Updated 11/09/1998 by David Huggins-Daines diff --git a/drivers/net/appletalk/Kconfig b/drivers/net/appletalk/Kconfig index ccde6479050c..10589a82263b 100644 --- a/drivers/net/appletalk/Kconfig +++ b/drivers/net/appletalk/Kconfig @@ -48,7 +48,7 @@ config LTPC If you are in doubt, this card is the one with the 65C02 chip on it. You also need version 1.3.3 or later of the netatalk package. This driver is experimental, which means that it may not work. - See the file . + See the file . config COPS tristate "COPS LocalTalk PC support" -- cgit v1.2.3 From 429ff87bcac75b929d9ffec8d4d24be2616f8052 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:03:59 +0200 Subject: docs: networking: convert mac80211-injection.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/mac80211-injection.rst | 106 ++++++++++++++++++++++++ Documentation/networking/mac80211-injection.txt | 97 ---------------------- MAINTAINERS | 2 +- net/mac80211/tx.c | 2 +- 5 files changed, 109 insertions(+), 99 deletions(-) create mode 100644 Documentation/networking/mac80211-injection.rst delete mode 100644 Documentation/networking/mac80211-injection.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index b3608b177a8b..81c1834bfb57 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -77,6 +77,7 @@ Contents: l2tp lapb-module ltpc + mac80211-injection .. only:: subproject and html diff --git a/Documentation/networking/mac80211-injection.rst b/Documentation/networking/mac80211-injection.rst new file mode 100644 index 000000000000..75d4edcae852 --- /dev/null +++ b/Documentation/networking/mac80211-injection.rst @@ -0,0 +1,106 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================= +How to use packet injection with mac80211 +========================================= + +mac80211 now allows arbitrary packets to be injected down any Monitor Mode +interface from userland. The packet you inject needs to be composed in the +following format:: + + [ radiotap header ] + [ ieee80211 header ] + [ payload ] + +The radiotap format is discussed in +./Documentation/networking/radiotap-headers.txt. + +Despite many radiotap parameters being currently defined, most only make sense +to appear on received packets. The following information is parsed from the +radiotap headers and used to control injection: + + * IEEE80211_RADIOTAP_FLAGS + + ========================= =========================================== + IEEE80211_RADIOTAP_F_FCS FCS will be removed and recalculated + IEEE80211_RADIOTAP_F_WEP frame will be encrypted if key available + IEEE80211_RADIOTAP_F_FRAG frame will be fragmented if longer than the + current fragmentation threshold. + ========================= =========================================== + + * IEEE80211_RADIOTAP_TX_FLAGS + + ============================= ======================================== + IEEE80211_RADIOTAP_F_TX_NOACK frame should be sent without waiting for + an ACK even if it is a unicast frame + ============================= ======================================== + + * IEEE80211_RADIOTAP_RATE + + legacy rate for the transmission (only for devices without own rate control) + + * IEEE80211_RADIOTAP_MCS + + HT rate for the transmission (only for devices without own rate control). + Also some flags are parsed + + ============================ ======================== + IEEE80211_RADIOTAP_MCS_SGI use short guard interval + IEEE80211_RADIOTAP_MCS_BW_40 send in HT40 mode + ============================ ======================== + + * IEEE80211_RADIOTAP_DATA_RETRIES + + number of retries when either IEEE80211_RADIOTAP_RATE or + IEEE80211_RADIOTAP_MCS was used + + * IEEE80211_RADIOTAP_VHT + + VHT mcs and number of streams used in the transmission (only for devices + without own rate control). Also other fields are parsed + + flags field + IEEE80211_RADIOTAP_VHT_FLAG_SGI: use short guard interval + + bandwidth field + * 1: send using 40MHz channel width + * 4: send using 80MHz channel width + * 11: send using 160MHz channel width + +The injection code can also skip all other currently defined radiotap fields +facilitating replay of captured radiotap headers directly. + +Here is an example valid radiotap header defining some parameters:: + + 0x00, 0x00, // <-- radiotap version + 0x0b, 0x00, // <- radiotap header length + 0x04, 0x0c, 0x00, 0x00, // <-- bitmap + 0x6c, // <-- rate + 0x0c, //<-- tx power + 0x01 //<-- antenna + +The ieee80211 header follows immediately afterwards, looking for example like +this:: + + 0x08, 0x01, 0x00, 0x00, + 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, + 0x13, 0x22, 0x33, 0x44, 0x55, 0x66, + 0x13, 0x22, 0x33, 0x44, 0x55, 0x66, + 0x10, 0x86 + +Then lastly there is the payload. + +After composing the packet contents, it is sent by send()-ing it to a logical +mac80211 interface that is in Monitor mode. Libpcap can also be used, +(which is easier than doing the work to bind the socket to the right +interface), along the following lines::: + + ppcap = pcap_open_live(szInterfaceName, 800, 1, 20, szErrbuf); + ... + r = pcap_inject(ppcap, u8aSendBuffer, nLength); + +You can also find a link to a complete inject application here: + +http://wireless.kernel.org/en/users/Documentation/packetspammer + +Andy Green diff --git a/Documentation/networking/mac80211-injection.txt b/Documentation/networking/mac80211-injection.txt deleted file mode 100644 index d58d78df9ca2..000000000000 --- a/Documentation/networking/mac80211-injection.txt +++ /dev/null @@ -1,97 +0,0 @@ -How to use packet injection with mac80211 -========================================= - -mac80211 now allows arbitrary packets to be injected down any Monitor Mode -interface from userland. The packet you inject needs to be composed in the -following format: - - [ radiotap header ] - [ ieee80211 header ] - [ payload ] - -The radiotap format is discussed in -./Documentation/networking/radiotap-headers.txt. - -Despite many radiotap parameters being currently defined, most only make sense -to appear on received packets. The following information is parsed from the -radiotap headers and used to control injection: - - * IEEE80211_RADIOTAP_FLAGS - - IEEE80211_RADIOTAP_F_FCS: FCS will be removed and recalculated - IEEE80211_RADIOTAP_F_WEP: frame will be encrypted if key available - IEEE80211_RADIOTAP_F_FRAG: frame will be fragmented if longer than the - current fragmentation threshold. - - * IEEE80211_RADIOTAP_TX_FLAGS - - IEEE80211_RADIOTAP_F_TX_NOACK: frame should be sent without waiting for - an ACK even if it is a unicast frame - - * IEEE80211_RADIOTAP_RATE - - legacy rate for the transmission (only for devices without own rate control) - - * IEEE80211_RADIOTAP_MCS - - HT rate for the transmission (only for devices without own rate control). - Also some flags are parsed - - IEEE80211_RADIOTAP_MCS_SGI: use short guard interval - IEEE80211_RADIOTAP_MCS_BW_40: send in HT40 mode - - * IEEE80211_RADIOTAP_DATA_RETRIES - - number of retries when either IEEE80211_RADIOTAP_RATE or - IEEE80211_RADIOTAP_MCS was used - - * IEEE80211_RADIOTAP_VHT - - VHT mcs and number of streams used in the transmission (only for devices - without own rate control). Also other fields are parsed - - flags field - IEEE80211_RADIOTAP_VHT_FLAG_SGI: use short guard interval - - bandwidth field - 1: send using 40MHz channel width - 4: send using 80MHz channel width - 11: send using 160MHz channel width - -The injection code can also skip all other currently defined radiotap fields -facilitating replay of captured radiotap headers directly. - -Here is an example valid radiotap header defining some parameters - - 0x00, 0x00, // <-- radiotap version - 0x0b, 0x00, // <- radiotap header length - 0x04, 0x0c, 0x00, 0x00, // <-- bitmap - 0x6c, // <-- rate - 0x0c, //<-- tx power - 0x01 //<-- antenna - -The ieee80211 header follows immediately afterwards, looking for example like -this: - - 0x08, 0x01, 0x00, 0x00, - 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, - 0x13, 0x22, 0x33, 0x44, 0x55, 0x66, - 0x13, 0x22, 0x33, 0x44, 0x55, 0x66, - 0x10, 0x86 - -Then lastly there is the payload. - -After composing the packet contents, it is sent by send()-ing it to a logical -mac80211 interface that is in Monitor mode. Libpcap can also be used, -(which is easier than doing the work to bind the socket to the right -interface), along the following lines: - - ppcap = pcap_open_live(szInterfaceName, 800, 1, 20, szErrbuf); -... - r = pcap_inject(ppcap, u8aSendBuffer, nLength); - -You can also find a link to a complete inject application here: - -http://wireless.kernel.org/en/users/Documentation/packetspammer - -Andy Green diff --git a/MAINTAINERS b/MAINTAINERS index 956999d2d979..33bfc9e4aead 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -10079,7 +10079,7 @@ S: Maintained W: https://wireless.wiki.kernel.org/ T: git git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git T: git git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next.git -F: Documentation/networking/mac80211-injection.txt +F: Documentation/networking/mac80211-injection.rst F: Documentation/networking/mac80211_hwsim/mac80211_hwsim.rst F: drivers/net/wireless/mac80211_hwsim.[ch] F: include/net/mac80211.h diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c index 82846aca86d9..9849c14694db 100644 --- a/net/mac80211/tx.c +++ b/net/mac80211/tx.c @@ -2144,7 +2144,7 @@ static bool ieee80211_parse_tx_radiotap(struct ieee80211_local *local, /* * Please update the file - * Documentation/networking/mac80211-injection.txt + * Documentation/networking/mac80211-injection.rst * when parsing new fields here. */ -- cgit v1.2.3 From e14fd64dcda5668c5dd5e59421fc5c61ce6d5951 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:00 +0200 Subject: docs: networking: convert mpls-sysctl.txt to ReST - add SPDX header; - add a document title; - mark lists as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/mpls-sysctl.rst | 57 ++++++++++++++++++++++++++++++++ Documentation/networking/mpls-sysctl.txt | 48 --------------------------- 3 files changed, 58 insertions(+), 48 deletions(-) create mode 100644 Documentation/networking/mpls-sysctl.rst delete mode 100644 Documentation/networking/mpls-sysctl.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 81c1834bfb57..a751cda83c3d 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -78,6 +78,7 @@ Contents: lapb-module ltpc mac80211-injection + mpls-sysctl .. only:: subproject and html diff --git a/Documentation/networking/mpls-sysctl.rst b/Documentation/networking/mpls-sysctl.rst new file mode 100644 index 000000000000..0a2ac88404d7 --- /dev/null +++ b/Documentation/networking/mpls-sysctl.rst @@ -0,0 +1,57 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +MPLS Sysfs variables +==================== + +/proc/sys/net/mpls/* Variables: +=============================== + +platform_labels - INTEGER + Number of entries in the platform label table. It is not + possible to configure forwarding for label values equal to or + greater than the number of platform labels. + + A dense utilization of the entries in the platform label table + is possible and expected as the platform labels are locally + allocated. + + If the number of platform label table entries is set to 0 no + label will be recognized by the kernel and mpls forwarding + will be disabled. + + Reducing this value will remove all label routing entries that + no longer fit in the table. + + Possible values: 0 - 1048575 + + Default: 0 + +ip_ttl_propagate - BOOL + Control whether TTL is propagated from the IPv4/IPv6 header to + the MPLS header on imposing labels and propagated from the + MPLS header to the IPv4/IPv6 header on popping the last label. + + If disabled, the MPLS transport network will appear as a + single hop to transit traffic. + + * 0 - disabled / RFC 3443 [Short] Pipe Model + * 1 - enabled / RFC 3443 Uniform Model (default) + +default_ttl - INTEGER + Default TTL value to use for MPLS packets where it cannot be + propagated from an IP header, either because one isn't present + or ip_ttl_propagate has been disabled. + + Possible values: 1 - 255 + + Default: 255 + +conf//input - BOOL + Control whether packets can be input on this interface. + + If disabled, packets will be discarded without further + processing. + + * 0 - disabled (default) + * not 0 - enabled diff --git a/Documentation/networking/mpls-sysctl.txt b/Documentation/networking/mpls-sysctl.txt deleted file mode 100644 index 025cc9b96992..000000000000 --- a/Documentation/networking/mpls-sysctl.txt +++ /dev/null @@ -1,48 +0,0 @@ -/proc/sys/net/mpls/* Variables: - -platform_labels - INTEGER - Number of entries in the platform label table. It is not - possible to configure forwarding for label values equal to or - greater than the number of platform labels. - - A dense utilization of the entries in the platform label table - is possible and expected as the platform labels are locally - allocated. - - If the number of platform label table entries is set to 0 no - label will be recognized by the kernel and mpls forwarding - will be disabled. - - Reducing this value will remove all label routing entries that - no longer fit in the table. - - Possible values: 0 - 1048575 - Default: 0 - -ip_ttl_propagate - BOOL - Control whether TTL is propagated from the IPv4/IPv6 header to - the MPLS header on imposing labels and propagated from the - MPLS header to the IPv4/IPv6 header on popping the last label. - - If disabled, the MPLS transport network will appear as a - single hop to transit traffic. - - 0 - disabled / RFC 3443 [Short] Pipe Model - 1 - enabled / RFC 3443 Uniform Model (default) - -default_ttl - INTEGER - Default TTL value to use for MPLS packets where it cannot be - propagated from an IP header, either because one isn't present - or ip_ttl_propagate has been disabled. - - Possible values: 1 - 255 - Default: 255 - -conf//input - BOOL - Control whether packets can be input on this interface. - - If disabled, packets will be discarded without further - processing. - - 0 - disabled (default) - not 0 - enabled -- cgit v1.2.3 From e98aa68223e48164c559803d25484533dfd495ba Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:01 +0200 Subject: docs: networking: convert multiqueue.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - mark code blocks and literals as such; - use :field: markup; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/bonding.rst | 2 +- Documentation/networking/index.rst | 1 + Documentation/networking/multiqueue.rst | 78 ++++++++++++++++++++++++++++++++ Documentation/networking/multiqueue.txt | 79 --------------------------------- 4 files changed, 80 insertions(+), 80 deletions(-) create mode 100644 Documentation/networking/multiqueue.rst delete mode 100644 Documentation/networking/multiqueue.txt (limited to 'Documentation') diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst index dd49f95d28d3..24168b0d16bd 100644 --- a/Documentation/networking/bonding.rst +++ b/Documentation/networking/bonding.rst @@ -1639,7 +1639,7 @@ can safely be sent over either interface. Such configurations may be achieved using the traffic control utilities inherent in linux. By default the bonding driver is multiqueue aware and 16 queues are created -when the driver initializes (see Documentation/networking/multiqueue.txt +when the driver initializes (see Documentation/networking/multiqueue.rst for details). If more or less queues are desired the module parameter tx_queues can be used to change this value. There is no sysfs parameter available as the allocation is done at module init time. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index a751cda83c3d..492658bf7c0d 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -79,6 +79,7 @@ Contents: ltpc mac80211-injection mpls-sysctl + multiqueue .. only:: subproject and html diff --git a/Documentation/networking/multiqueue.rst b/Documentation/networking/multiqueue.rst new file mode 100644 index 000000000000..0a576166e9dd --- /dev/null +++ b/Documentation/networking/multiqueue.rst @@ -0,0 +1,78 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================================== +HOWTO for multiqueue network device support +=========================================== + +Section 1: Base driver requirements for implementing multiqueue support +======================================================================= + +Intro: Kernel support for multiqueue devices +--------------------------------------------------------- + +Kernel support for multiqueue devices is always present. + +Base drivers are required to use the new alloc_etherdev_mq() or +alloc_netdev_mq() functions to allocate the subqueues for the device. The +underlying kernel API will take care of the allocation and deallocation of +the subqueue memory, as well as netdev configuration of where the queues +exist in memory. + +The base driver will also need to manage the queues as it does the global +netdev->queue_lock today. Therefore base drivers should use the +netif_{start|stop|wake}_subqueue() functions to manage each queue while the +device is still operational. netdev->queue_lock is still used when the device +comes online or when it's completely shut down (unregister_netdev(), etc.). + + +Section 2: Qdisc support for multiqueue devices +=============================================== + +Currently two qdiscs are optimized for multiqueue devices. The first is the +default pfifo_fast qdisc. This qdisc supports one qdisc per hardware queue. +A new round-robin qdisc, sch_multiq also supports multiple hardware queues. The +qdisc is responsible for classifying the skb's and then directing the skb's to +bands and queues based on the value in skb->queue_mapping. Use this field in +the base driver to determine which queue to send the skb to. + +sch_multiq has been added for hardware that wishes to avoid head-of-line +blocking. It will cycle though the bands and verify that the hardware queue +associated with the band is not stopped prior to dequeuing a packet. + +On qdisc load, the number of bands is based on the number of queues on the +hardware. Once the association is made, any skb with skb->queue_mapping set, +will be queued to the band associated with the hardware queue. + + +Section 3: Brief howto using MULTIQ for multiqueue devices +========================================================== + +The userspace command 'tc,' part of the iproute2 package, is used to configure +qdiscs. To add the MULTIQ qdisc to your network device, assuming the device +is called eth0, run the following command:: + + # tc qdisc add dev eth0 root handle 1: multiq + +The qdisc will allocate the number of bands to equal the number of queues that +the device reports, and bring the qdisc online. Assuming eth0 has 4 Tx +queues, the band mapping would look like:: + + band 0 => queue 0 + band 1 => queue 1 + band 2 => queue 2 + band 3 => queue 3 + +Traffic will begin flowing through each queue based on either the simple_tx_hash +function or based on netdev->select_queue() if you have it defined. + +The behavior of tc filters remains the same. However a new tc action, +skbedit, has been added. Assuming you wanted to route all traffic to a +specific host, for example 192.168.0.3, through a specific queue you could use +this action and establish a filter such as:: + + tc filter add dev eth0 parent 1: protocol ip prio 1 u32 \ + match ip dst 192.168.0.3 \ + action skbedit queue_mapping 3 + +:Author: Alexander Duyck +:Original Author: Peter P. Waskiewicz Jr. diff --git a/Documentation/networking/multiqueue.txt b/Documentation/networking/multiqueue.txt deleted file mode 100644 index 4caa0e314cc2..000000000000 --- a/Documentation/networking/multiqueue.txt +++ /dev/null @@ -1,79 +0,0 @@ - - HOWTO for multiqueue network device support - =========================================== - -Section 1: Base driver requirements for implementing multiqueue support - -Intro: Kernel support for multiqueue devices ---------------------------------------------------------- - -Kernel support for multiqueue devices is always present. - -Section 1: Base driver requirements for implementing multiqueue support ------------------------------------------------------------------------ - -Base drivers are required to use the new alloc_etherdev_mq() or -alloc_netdev_mq() functions to allocate the subqueues for the device. The -underlying kernel API will take care of the allocation and deallocation of -the subqueue memory, as well as netdev configuration of where the queues -exist in memory. - -The base driver will also need to manage the queues as it does the global -netdev->queue_lock today. Therefore base drivers should use the -netif_{start|stop|wake}_subqueue() functions to manage each queue while the -device is still operational. netdev->queue_lock is still used when the device -comes online or when it's completely shut down (unregister_netdev(), etc.). - - -Section 2: Qdisc support for multiqueue devices - ------------------------------------------------ - -Currently two qdiscs are optimized for multiqueue devices. The first is the -default pfifo_fast qdisc. This qdisc supports one qdisc per hardware queue. -A new round-robin qdisc, sch_multiq also supports multiple hardware queues. The -qdisc is responsible for classifying the skb's and then directing the skb's to -bands and queues based on the value in skb->queue_mapping. Use this field in -the base driver to determine which queue to send the skb to. - -sch_multiq has been added for hardware that wishes to avoid head-of-line -blocking. It will cycle though the bands and verify that the hardware queue -associated with the band is not stopped prior to dequeuing a packet. - -On qdisc load, the number of bands is based on the number of queues on the -hardware. Once the association is made, any skb with skb->queue_mapping set, -will be queued to the band associated with the hardware queue. - - -Section 3: Brief howto using MULTIQ for multiqueue devices ---------------------------------------------------------------- - -The userspace command 'tc,' part of the iproute2 package, is used to configure -qdiscs. To add the MULTIQ qdisc to your network device, assuming the device -is called eth0, run the following command: - -# tc qdisc add dev eth0 root handle 1: multiq - -The qdisc will allocate the number of bands to equal the number of queues that -the device reports, and bring the qdisc online. Assuming eth0 has 4 Tx -queues, the band mapping would look like: - -band 0 => queue 0 -band 1 => queue 1 -band 2 => queue 2 -band 3 => queue 3 - -Traffic will begin flowing through each queue based on either the simple_tx_hash -function or based on netdev->select_queue() if you have it defined. - -The behavior of tc filters remains the same. However a new tc action, -skbedit, has been added. Assuming you wanted to route all traffic to a -specific host, for example 192.168.0.3, through a specific queue you could use -this action and establish a filter such as: - -tc filter add dev eth0 parent 1: protocol ip prio 1 u32 \ - match ip dst 192.168.0.3 \ - action skbedit queue_mapping 3 - -Author: Alexander Duyck -Original Author: Peter P. Waskiewicz Jr. -- cgit v1.2.3 From d9d6ef25ecab3e10966981bc58d014576da74272 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:02 +0200 Subject: docs: networking: convert netconsole.txt to ReST - add SPDX header; - add a document title; - mark code blocks and literals as such; - mark tables as such; - add notes markups; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/admin-guide/kernel-parameters.txt | 2 +- Documentation/admin-guide/serial-console.rst | 2 +- Documentation/networking/index.rst | 1 + Documentation/networking/netconsole.rst | 239 ++++++++++++++++++++++++ Documentation/networking/netconsole.txt | 210 --------------------- drivers/net/Kconfig | 4 +- drivers/net/ethernet/toshiba/ps3_gelic_net.c | 2 +- drivers/net/ethernet/toshiba/spider_net.c | 2 +- 8 files changed, 246 insertions(+), 216 deletions(-) create mode 100644 Documentation/networking/netconsole.rst delete mode 100644 Documentation/networking/netconsole.txt (limited to 'Documentation') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index e43f2e1f2958..398c34804bb8 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -638,7 +638,7 @@ See Documentation/admin-guide/serial-console.rst for more information. See - Documentation/networking/netconsole.txt for an + Documentation/networking/netconsole.rst for an alternative. uart[8250],io,[,options] diff --git a/Documentation/admin-guide/serial-console.rst b/Documentation/admin-guide/serial-console.rst index a8d1e36b627a..58b32832e50a 100644 --- a/Documentation/admin-guide/serial-console.rst +++ b/Documentation/admin-guide/serial-console.rst @@ -54,7 +54,7 @@ You will need to create a new device to use ``/dev/console``. The official ``/dev/console`` is now character device 5,1. (You can also use a network device as a console. See -``Documentation/networking/netconsole.txt`` for information on that.) +``Documentation/networking/netconsole.rst`` for information on that.) Here's an example that will use ``/dev/ttyS1`` (COM2) as the console. Replace the sample values as needed. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 492658bf7c0d..e58f872d401d 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -80,6 +80,7 @@ Contents: mac80211-injection mpls-sysctl multiqueue + netconsole .. only:: subproject and html diff --git a/Documentation/networking/netconsole.rst b/Documentation/networking/netconsole.rst new file mode 100644 index 000000000000..1f5c4a04027c --- /dev/null +++ b/Documentation/networking/netconsole.rst @@ -0,0 +1,239 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========== +Netconsole +========== + + +started by Ingo Molnar , 2001.09.17 + +2.6 port and netpoll api by Matt Mackall , Sep 9 2003 + +IPv6 support by Cong Wang , Jan 1 2013 + +Extended console support by Tejun Heo , May 1 2015 + +Please send bug reports to Matt Mackall +Satyam Sharma , and Cong Wang + +Introduction: +============= + +This module logs kernel printk messages over UDP allowing debugging of +problem where disk logging fails and serial consoles are impractical. + +It can be used either built-in or as a module. As a built-in, +netconsole initializes immediately after NIC cards and will bring up +the specified interface as soon as possible. While this doesn't allow +capture of early kernel panics, it does capture most of the boot +process. + +Sender and receiver configuration: +================================== + +It takes a string configuration parameter "netconsole" in the +following format:: + + netconsole=[+][src-port]@[src-ip]/[],[tgt-port]@/[tgt-macaddr] + + where + + if present, enable extended console support + src-port source for UDP packets (defaults to 6665) + src-ip source IP to use (interface address) + dev network interface (eth0) + tgt-port port for logging agent (6666) + tgt-ip IP address for logging agent + tgt-macaddr ethernet MAC address for logging agent (broadcast) + +Examples:: + + linux netconsole=4444@10.0.0.1/eth1,9353@10.0.0.2/12:34:56:78:9a:bc + +or:: + + insmod netconsole netconsole=@/,@10.0.0.2/ + +or using IPv6:: + + insmod netconsole netconsole=@/,@fd00:1:2:3::1/ + +It also supports logging to multiple remote agents by specifying +parameters for the multiple agents separated by semicolons and the +complete string enclosed in "quotes", thusly:: + + modprobe netconsole netconsole="@/,@10.0.0.2/;@/eth1,6892@10.0.0.3/" + +Built-in netconsole starts immediately after the TCP stack is +initialized and attempts to bring up the supplied dev at the supplied +address. + +The remote host has several options to receive the kernel messages, +for example: + +1) syslogd + +2) netcat + + On distributions using a BSD-based netcat version (e.g. Fedora, + openSUSE and Ubuntu) the listening port must be specified without + the -p switch:: + + nc -u -l -p ' / 'nc -u -l + + or:: + + netcat -u -l -p ' / 'netcat -u -l + +3) socat + +:: + + socat udp-recv: - + +Dynamic reconfiguration: +======================== + +Dynamic reconfigurability is a useful addition to netconsole that enables +remote logging targets to be dynamically added, removed, or have their +parameters reconfigured at runtime from a configfs-based userspace interface. +[ Note that the parameters of netconsole targets that were specified/created +from the boot/module option are not exposed via this interface, and hence +cannot be modified dynamically. ] + +To include this feature, select CONFIG_NETCONSOLE_DYNAMIC when building the +netconsole module (or kernel, if netconsole is built-in). + +Some examples follow (where configfs is mounted at the /sys/kernel/config +mountpoint). + +To add a remote logging target (target names can be arbitrary):: + + cd /sys/kernel/config/netconsole/ + mkdir target1 + +Note that newly created targets have default parameter values (as mentioned +above) and are disabled by default -- they must first be enabled by writing +"1" to the "enabled" attribute (usually after setting parameters accordingly) +as described below. + +To remove a target:: + + rmdir /sys/kernel/config/netconsole/othertarget/ + +The interface exposes these parameters of a netconsole target to userspace: + + ============== ================================= ============ + enabled Is this target currently enabled? (read-write) + extended Extended mode enabled (read-write) + dev_name Local network interface name (read-write) + local_port Source UDP port to use (read-write) + remote_port Remote agent's UDP port (read-write) + local_ip Source IP address to use (read-write) + remote_ip Remote agent's IP address (read-write) + local_mac Local interface's MAC address (read-only) + remote_mac Remote agent's MAC address (read-write) + ============== ================================= ============ + +The "enabled" attribute is also used to control whether the parameters of +a target can be updated or not -- you can modify the parameters of only +disabled targets (i.e. if "enabled" is 0). + +To update a target's parameters:: + + cat enabled # check if enabled is 1 + echo 0 > enabled # disable the target (if required) + echo eth2 > dev_name # set local interface + echo 10.0.0.4 > remote_ip # update some parameter + echo cb:a9:87:65:43:21 > remote_mac # update more parameters + echo 1 > enabled # enable target again + +You can also update the local interface dynamically. This is especially +useful if you want to use interfaces that have newly come up (and may not +have existed when netconsole was loaded / initialized). + +Extended console: +================= + +If '+' is prefixed to the configuration line or "extended" config file +is set to 1, extended console support is enabled. An example boot +param follows:: + + linux netconsole=+4444@10.0.0.1/eth1,9353@10.0.0.2/12:34:56:78:9a:bc + +Log messages are transmitted with extended metadata header in the +following format which is the same as /dev/kmsg:: + + ,,,; + +Non printable characters in are escaped using "\xff" +notation. If the message contains optional dictionary, verbatim +newline is used as the delimeter. + +If a message doesn't fit in certain number of bytes (currently 1000), +the message is split into multiple fragments by netconsole. These +fragments are transmitted with "ncfrag" header field added:: + + ncfrag=/ + +For example, assuming a lot smaller chunk size, a message "the first +chunk, the 2nd chunk." may be split as follows:: + + 6,416,1758426,-,ncfrag=0/31;the first chunk, + 6,416,1758426,-,ncfrag=16/31; the 2nd chunk. + +Miscellaneous notes: +==================== + +.. Warning:: + + the default target ethernet setting uses the broadcast + ethernet address to send packets, which can cause increased load on + other systems on the same ethernet segment. + +.. Tip:: + + some LAN switches may be configured to suppress ethernet broadcasts + so it is advised to explicitly specify the remote agents' MAC addresses + from the config parameters passed to netconsole. + +.. Tip:: + + to find out the MAC address of, say, 10.0.0.2, you may try using:: + + ping -c 1 10.0.0.2 ; /sbin/arp -n | grep 10.0.0.2 + +.. Tip:: + + in case the remote logging agent is on a separate LAN subnet than + the sender, it is suggested to try specifying the MAC address of the + default gateway (you may use /sbin/route -n to find it out) as the + remote MAC address instead. + +.. note:: + + the network device (eth1 in the above case) can run any kind + of other network traffic, netconsole is not intrusive. Netconsole + might cause slight delays in other traffic if the volume of kernel + messages is high, but should have no other impact. + +.. note:: + + if you find that the remote logging agent is not receiving or + printing all messages from the sender, it is likely that you have set + the "console_loglevel" parameter (on the sender) to only send high + priority messages to the console. You can change this at runtime using:: + + dmesg -n 8 + + or by specifying "debug" on the kernel command line at boot, to send + all kernel messages to the console. A specific value for this parameter + can also be set using the "loglevel" kernel boot option. See the + dmesg(8) man page and Documentation/admin-guide/kernel-parameters.rst + for details. + +Netconsole was designed to be as instantaneous as possible, to +enable the logging of even the most critical kernel bugs. It works +from IRQ contexts as well, and does not enable interrupts while +sending packets. Due to these unique needs, configuration cannot +be more automatic, and some fundamental limitations will remain: +only IP networks, UDP packets and ethernet devices are supported. diff --git a/Documentation/networking/netconsole.txt b/Documentation/networking/netconsole.txt deleted file mode 100644 index 296ea00fd3eb..000000000000 --- a/Documentation/networking/netconsole.txt +++ /dev/null @@ -1,210 +0,0 @@ - -started by Ingo Molnar , 2001.09.17 -2.6 port and netpoll api by Matt Mackall , Sep 9 2003 -IPv6 support by Cong Wang , Jan 1 2013 -Extended console support by Tejun Heo , May 1 2015 - -Please send bug reports to Matt Mackall -Satyam Sharma , and Cong Wang - -Introduction: -============= - -This module logs kernel printk messages over UDP allowing debugging of -problem where disk logging fails and serial consoles are impractical. - -It can be used either built-in or as a module. As a built-in, -netconsole initializes immediately after NIC cards and will bring up -the specified interface as soon as possible. While this doesn't allow -capture of early kernel panics, it does capture most of the boot -process. - -Sender and receiver configuration: -================================== - -It takes a string configuration parameter "netconsole" in the -following format: - - netconsole=[+][src-port]@[src-ip]/[],[tgt-port]@/[tgt-macaddr] - - where - + if present, enable extended console support - src-port source for UDP packets (defaults to 6665) - src-ip source IP to use (interface address) - dev network interface (eth0) - tgt-port port for logging agent (6666) - tgt-ip IP address for logging agent - tgt-macaddr ethernet MAC address for logging agent (broadcast) - -Examples: - - linux netconsole=4444@10.0.0.1/eth1,9353@10.0.0.2/12:34:56:78:9a:bc - - or - - insmod netconsole netconsole=@/,@10.0.0.2/ - - or using IPv6 - - insmod netconsole netconsole=@/,@fd00:1:2:3::1/ - -It also supports logging to multiple remote agents by specifying -parameters for the multiple agents separated by semicolons and the -complete string enclosed in "quotes", thusly: - - modprobe netconsole netconsole="@/,@10.0.0.2/;@/eth1,6892@10.0.0.3/" - -Built-in netconsole starts immediately after the TCP stack is -initialized and attempts to bring up the supplied dev at the supplied -address. - -The remote host has several options to receive the kernel messages, -for example: - -1) syslogd - -2) netcat - - On distributions using a BSD-based netcat version (e.g. Fedora, - openSUSE and Ubuntu) the listening port must be specified without - the -p switch: - - 'nc -u -l -p ' / 'nc -u -l ' or - 'netcat -u -l -p ' / 'netcat -u -l ' - -3) socat - - 'socat udp-recv: -' - -Dynamic reconfiguration: -======================== - -Dynamic reconfigurability is a useful addition to netconsole that enables -remote logging targets to be dynamically added, removed, or have their -parameters reconfigured at runtime from a configfs-based userspace interface. -[ Note that the parameters of netconsole targets that were specified/created -from the boot/module option are not exposed via this interface, and hence -cannot be modified dynamically. ] - -To include this feature, select CONFIG_NETCONSOLE_DYNAMIC when building the -netconsole module (or kernel, if netconsole is built-in). - -Some examples follow (where configfs is mounted at the /sys/kernel/config -mountpoint). - -To add a remote logging target (target names can be arbitrary): - - cd /sys/kernel/config/netconsole/ - mkdir target1 - -Note that newly created targets have default parameter values (as mentioned -above) and are disabled by default -- they must first be enabled by writing -"1" to the "enabled" attribute (usually after setting parameters accordingly) -as described below. - -To remove a target: - - rmdir /sys/kernel/config/netconsole/othertarget/ - -The interface exposes these parameters of a netconsole target to userspace: - - enabled Is this target currently enabled? (read-write) - extended Extended mode enabled (read-write) - dev_name Local network interface name (read-write) - local_port Source UDP port to use (read-write) - remote_port Remote agent's UDP port (read-write) - local_ip Source IP address to use (read-write) - remote_ip Remote agent's IP address (read-write) - local_mac Local interface's MAC address (read-only) - remote_mac Remote agent's MAC address (read-write) - -The "enabled" attribute is also used to control whether the parameters of -a target can be updated or not -- you can modify the parameters of only -disabled targets (i.e. if "enabled" is 0). - -To update a target's parameters: - - cat enabled # check if enabled is 1 - echo 0 > enabled # disable the target (if required) - echo eth2 > dev_name # set local interface - echo 10.0.0.4 > remote_ip # update some parameter - echo cb:a9:87:65:43:21 > remote_mac # update more parameters - echo 1 > enabled # enable target again - -You can also update the local interface dynamically. This is especially -useful if you want to use interfaces that have newly come up (and may not -have existed when netconsole was loaded / initialized). - -Extended console: -================= - -If '+' is prefixed to the configuration line or "extended" config file -is set to 1, extended console support is enabled. An example boot -param follows. - - linux netconsole=+4444@10.0.0.1/eth1,9353@10.0.0.2/12:34:56:78:9a:bc - -Log messages are transmitted with extended metadata header in the -following format which is the same as /dev/kmsg. - - ,,,; - -Non printable characters in are escaped using "\xff" -notation. If the message contains optional dictionary, verbatim -newline is used as the delimeter. - -If a message doesn't fit in certain number of bytes (currently 1000), -the message is split into multiple fragments by netconsole. These -fragments are transmitted with "ncfrag" header field added. - - ncfrag=/ - -For example, assuming a lot smaller chunk size, a message "the first -chunk, the 2nd chunk." may be split as follows. - - 6,416,1758426,-,ncfrag=0/31;the first chunk, - 6,416,1758426,-,ncfrag=16/31; the 2nd chunk. - -Miscellaneous notes: -==================== - -WARNING: the default target ethernet setting uses the broadcast -ethernet address to send packets, which can cause increased load on -other systems on the same ethernet segment. - -TIP: some LAN switches may be configured to suppress ethernet broadcasts -so it is advised to explicitly specify the remote agents' MAC addresses -from the config parameters passed to netconsole. - -TIP: to find out the MAC address of, say, 10.0.0.2, you may try using: - - ping -c 1 10.0.0.2 ; /sbin/arp -n | grep 10.0.0.2 - -TIP: in case the remote logging agent is on a separate LAN subnet than -the sender, it is suggested to try specifying the MAC address of the -default gateway (you may use /sbin/route -n to find it out) as the -remote MAC address instead. - -NOTE: the network device (eth1 in the above case) can run any kind -of other network traffic, netconsole is not intrusive. Netconsole -might cause slight delays in other traffic if the volume of kernel -messages is high, but should have no other impact. - -NOTE: if you find that the remote logging agent is not receiving or -printing all messages from the sender, it is likely that you have set -the "console_loglevel" parameter (on the sender) to only send high -priority messages to the console. You can change this at runtime using: - - dmesg -n 8 - -or by specifying "debug" on the kernel command line at boot, to send -all kernel messages to the console. A specific value for this parameter -can also be set using the "loglevel" kernel boot option. See the -dmesg(8) man page and Documentation/admin-guide/kernel-parameters.rst for details. - -Netconsole was designed to be as instantaneous as possible, to -enable the logging of even the most critical kernel bugs. It works -from IRQ contexts as well, and does not enable interrupts while -sending packets. Due to these unique needs, configuration cannot -be more automatic, and some fundamental limitations will remain: -only IP networks, UDP packets and ethernet devices are supported. diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index c822f4a6d166..ad64be98330f 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -302,7 +302,7 @@ config NETCONSOLE tristate "Network console logging support" ---help--- If you want to log kernel messages over the network, enable this. - See for details. + See for details. config NETCONSOLE_DYNAMIC bool "Dynamic reconfiguration of logging targets" @@ -312,7 +312,7 @@ config NETCONSOLE_DYNAMIC This option enables the ability to dynamically reconfigure target parameters (interface, IP addresses, port numbers, MAC addresses) at runtime through a userspace interface exported using configfs. - See for details. + See for details. config NETPOLL def_bool NETCONSOLE diff --git a/drivers/net/ethernet/toshiba/ps3_gelic_net.c b/drivers/net/ethernet/toshiba/ps3_gelic_net.c index 070dd6fa9401..310e6839c6e5 100644 --- a/drivers/net/ethernet/toshiba/ps3_gelic_net.c +++ b/drivers/net/ethernet/toshiba/ps3_gelic_net.c @@ -1150,7 +1150,7 @@ static irqreturn_t gelic_card_interrupt(int irq, void *ptr) * gelic_net_poll_controller - artificial interrupt for netconsole etc. * @netdev: interface device structure * - * see Documentation/networking/netconsole.txt + * see Documentation/networking/netconsole.rst */ void gelic_net_poll_controller(struct net_device *netdev) { diff --git a/drivers/net/ethernet/toshiba/spider_net.c b/drivers/net/ethernet/toshiba/spider_net.c index 6576271642c1..3902b3aeb0c2 100644 --- a/drivers/net/ethernet/toshiba/spider_net.c +++ b/drivers/net/ethernet/toshiba/spider_net.c @@ -1615,7 +1615,7 @@ spider_net_interrupt(int irq, void *ptr) * spider_net_poll_controller - artificial interrupt for netconsole etc. * @netdev: interface device structure * - * see Documentation/networking/netconsole.txt + * see Documentation/networking/netconsole.rst */ static void spider_net_poll_controller(struct net_device *netdev) -- cgit v1.2.3 From ea5bacaa2cec6967ed337f4d0ad6034123ca737b Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:03 +0200 Subject: docs: networking: convert netdev-features.txt to ReST Not much to be done here: - add SPDX header; - adjust titles and chapters, adding proper markups; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/checksum-offloads.rst | 2 +- Documentation/networking/index.rst | 1 + Documentation/networking/netdev-features.rst | 184 +++++++++++++++++++++++++ Documentation/networking/netdev-features.txt | 181 ------------------------ include/linux/netdev_features.h | 2 +- 5 files changed, 187 insertions(+), 183 deletions(-) create mode 100644 Documentation/networking/netdev-features.rst delete mode 100644 Documentation/networking/netdev-features.txt (limited to 'Documentation') diff --git a/Documentation/networking/checksum-offloads.rst b/Documentation/networking/checksum-offloads.rst index 905c8a84b103..69b23cf6879e 100644 --- a/Documentation/networking/checksum-offloads.rst +++ b/Documentation/networking/checksum-offloads.rst @@ -59,7 +59,7 @@ recomputed for each resulting segment. See the skbuff.h comment (section 'E') for more details. A driver declares its offload capabilities in netdev->hw_features; see -Documentation/networking/netdev-features.txt for more. Note that a device +Documentation/networking/netdev-features.rst for more. Note that a device which only advertises NETIF_F_IP[V6]_CSUM must still obey the csum_start and csum_offset given in the SKB; if it tries to deduce these itself in hardware (as some NICs do) the driver should check that the values in the SKB match diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index e58f872d401d..4c6aa3db97d4 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -81,6 +81,7 @@ Contents: mpls-sysctl multiqueue netconsole + netdev-features .. only:: subproject and html diff --git a/Documentation/networking/netdev-features.rst b/Documentation/networking/netdev-features.rst new file mode 100644 index 000000000000..a2d7d7160e39 --- /dev/null +++ b/Documentation/networking/netdev-features.rst @@ -0,0 +1,184 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================================================== +Netdev features mess and how to get out from it alive +===================================================== + +Author: + MichaÅ‚ MirosÅ‚aw + + + +Part I: Feature sets +==================== + +Long gone are the days when a network card would just take and give packets +verbatim. Today's devices add multiple features and bugs (read: offloads) +that relieve an OS of various tasks like generating and checking checksums, +splitting packets, classifying them. Those capabilities and their state +are commonly referred to as netdev features in Linux kernel world. + +There are currently three sets of features relevant to the driver, and +one used internally by network core: + + 1. netdev->hw_features set contains features whose state may possibly + be changed (enabled or disabled) for a particular device by user's + request. This set should be initialized in ndo_init callback and not + changed later. + + 2. netdev->features set contains features which are currently enabled + for a device. This should be changed only by network core or in + error paths of ndo_set_features callback. + + 3. netdev->vlan_features set contains features whose state is inherited + by child VLAN devices (limits netdev->features set). This is currently + used for all VLAN devices whether tags are stripped or inserted in + hardware or software. + + 4. netdev->wanted_features set contains feature set requested by user. + This set is filtered by ndo_fix_features callback whenever it or + some device-specific conditions change. This set is internal to + networking core and should not be referenced in drivers. + + + +Part II: Controlling enabled features +===================================== + +When current feature set (netdev->features) is to be changed, new set +is calculated and filtered by calling ndo_fix_features callback +and netdev_fix_features(). If the resulting set differs from current +set, it is passed to ndo_set_features callback and (if the callback +returns success) replaces value stored in netdev->features. +NETDEV_FEAT_CHANGE notification is issued after that whenever current +set might have changed. + +The following events trigger recalculation: + 1. device's registration, after ndo_init returned success + 2. user requested changes in features state + 3. netdev_update_features() is called + +ndo_*_features callbacks are called with rtnl_lock held. Missing callbacks +are treated as always returning success. + +A driver that wants to trigger recalculation must do so by calling +netdev_update_features() while holding rtnl_lock. This should not be done +from ndo_*_features callbacks. netdev->features should not be modified by +driver except by means of ndo_fix_features callback. + + + +Part III: Implementation hints +============================== + + * ndo_fix_features: + +All dependencies between features should be resolved here. The resulting +set can be reduced further by networking core imposed limitations (as coded +in netdev_fix_features()). For this reason it is safer to disable a feature +when its dependencies are not met instead of forcing the dependency on. + +This callback should not modify hardware nor driver state (should be +stateless). It can be called multiple times between successive +ndo_set_features calls. + +Callback must not alter features contained in NETIF_F_SOFT_FEATURES or +NETIF_F_NEVER_CHANGE sets. The exception is NETIF_F_VLAN_CHALLENGED but +care must be taken as the change won't affect already configured VLANs. + + * ndo_set_features: + +Hardware should be reconfigured to match passed feature set. The set +should not be altered unless some error condition happens that can't +be reliably detected in ndo_fix_features. In this case, the callback +should update netdev->features to match resulting hardware state. +Errors returned are not (and cannot be) propagated anywhere except dmesg. +(Note: successful return is zero, >0 means silent error.) + + + +Part IV: Features +================= + +For current list of features, see include/linux/netdev_features.h. +This section describes semantics of some of them. + + * Transmit checksumming + +For complete description, see comments near the top of include/linux/skbuff.h. + +Note: NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM + NETIF_F_IPV6_CSUM. +It means that device can fill TCP/UDP-like checksum anywhere in the packets +whatever headers there might be. + + * Transmit TCP segmentation offload + +NETIF_F_TSO_ECN means that hardware can properly split packets with CWR bit +set, be it TCPv4 (when NETIF_F_TSO is enabled) or TCPv6 (NETIF_F_TSO6). + + * Transmit UDP segmentation offload + +NETIF_F_GSO_UDP_L4 accepts a single UDP header with a payload that exceeds +gso_size. On segmentation, it segments the payload on gso_size boundaries and +replicates the network and UDP headers (fixing up the last one if less than +gso_size). + + * Transmit DMA from high memory + +On platforms where this is relevant, NETIF_F_HIGHDMA signals that +ndo_start_xmit can handle skbs with frags in high memory. + + * Transmit scatter-gather + +Those features say that ndo_start_xmit can handle fragmented skbs: +NETIF_F_SG --- paged skbs (skb_shinfo()->frags), NETIF_F_FRAGLIST --- +chained skbs (skb->next/prev list). + + * Software features + +Features contained in NETIF_F_SOFT_FEATURES are features of networking +stack. Driver should not change behaviour based on them. + + * LLTX driver (deprecated for hardware drivers) + +NETIF_F_LLTX is meant to be used by drivers that don't need locking at all, +e.g. software tunnels. + +This is also used in a few legacy drivers that implement their +own locking, don't use it for new (hardware) drivers. + + * netns-local device + +NETIF_F_NETNS_LOCAL is set for devices that are not allowed to move between +network namespaces (e.g. loopback). + +Don't use it in drivers. + + * VLAN challenged + +NETIF_F_VLAN_CHALLENGED should be set for devices which can't cope with VLAN +headers. Some drivers set this because the cards can't handle the bigger MTU. +[FIXME: Those cases could be fixed in VLAN code by allowing only reduced-MTU +VLANs. This may be not useful, though.] + +* rx-fcs + +This requests that the NIC append the Ethernet Frame Checksum (FCS) +to the end of the skb data. This allows sniffers and other tools to +read the CRC recorded by the NIC on receipt of the packet. + +* rx-all + +This requests that the NIC receive all possible frames, including errored +frames (such as bad FCS, etc). This can be helpful when sniffing a link with +bad packets on it. Some NICs may receive more packets if also put into normal +PROMISC mode. + +* rx-gro-hw + +This requests that the NIC enables Hardware GRO (generic receive offload). +Hardware GRO is basically the exact reverse of TSO, and is generally +stricter than Hardware LRO. A packet stream merged by Hardware GRO must +be re-segmentable by GSO or TSO back to the exact original packet stream. +Hardware GRO is dependent on RXCSUM since every packet successfully merged +by hardware must also have the checksum verified by hardware. diff --git a/Documentation/networking/netdev-features.txt b/Documentation/networking/netdev-features.txt deleted file mode 100644 index 58dd1c1e3c65..000000000000 --- a/Documentation/networking/netdev-features.txt +++ /dev/null @@ -1,181 +0,0 @@ -Netdev features mess and how to get out from it alive -===================================================== - -Author: - MichaÅ‚ MirosÅ‚aw - - - - Part I: Feature sets -====================== - -Long gone are the days when a network card would just take and give packets -verbatim. Today's devices add multiple features and bugs (read: offloads) -that relieve an OS of various tasks like generating and checking checksums, -splitting packets, classifying them. Those capabilities and their state -are commonly referred to as netdev features in Linux kernel world. - -There are currently three sets of features relevant to the driver, and -one used internally by network core: - - 1. netdev->hw_features set contains features whose state may possibly - be changed (enabled or disabled) for a particular device by user's - request. This set should be initialized in ndo_init callback and not - changed later. - - 2. netdev->features set contains features which are currently enabled - for a device. This should be changed only by network core or in - error paths of ndo_set_features callback. - - 3. netdev->vlan_features set contains features whose state is inherited - by child VLAN devices (limits netdev->features set). This is currently - used for all VLAN devices whether tags are stripped or inserted in - hardware or software. - - 4. netdev->wanted_features set contains feature set requested by user. - This set is filtered by ndo_fix_features callback whenever it or - some device-specific conditions change. This set is internal to - networking core and should not be referenced in drivers. - - - - Part II: Controlling enabled features -======================================= - -When current feature set (netdev->features) is to be changed, new set -is calculated and filtered by calling ndo_fix_features callback -and netdev_fix_features(). If the resulting set differs from current -set, it is passed to ndo_set_features callback and (if the callback -returns success) replaces value stored in netdev->features. -NETDEV_FEAT_CHANGE notification is issued after that whenever current -set might have changed. - -The following events trigger recalculation: - 1. device's registration, after ndo_init returned success - 2. user requested changes in features state - 3. netdev_update_features() is called - -ndo_*_features callbacks are called with rtnl_lock held. Missing callbacks -are treated as always returning success. - -A driver that wants to trigger recalculation must do so by calling -netdev_update_features() while holding rtnl_lock. This should not be done -from ndo_*_features callbacks. netdev->features should not be modified by -driver except by means of ndo_fix_features callback. - - - - Part III: Implementation hints -================================ - - * ndo_fix_features: - -All dependencies between features should be resolved here. The resulting -set can be reduced further by networking core imposed limitations (as coded -in netdev_fix_features()). For this reason it is safer to disable a feature -when its dependencies are not met instead of forcing the dependency on. - -This callback should not modify hardware nor driver state (should be -stateless). It can be called multiple times between successive -ndo_set_features calls. - -Callback must not alter features contained in NETIF_F_SOFT_FEATURES or -NETIF_F_NEVER_CHANGE sets. The exception is NETIF_F_VLAN_CHALLENGED but -care must be taken as the change won't affect already configured VLANs. - - * ndo_set_features: - -Hardware should be reconfigured to match passed feature set. The set -should not be altered unless some error condition happens that can't -be reliably detected in ndo_fix_features. In this case, the callback -should update netdev->features to match resulting hardware state. -Errors returned are not (and cannot be) propagated anywhere except dmesg. -(Note: successful return is zero, >0 means silent error.) - - - - Part IV: Features -=================== - -For current list of features, see include/linux/netdev_features.h. -This section describes semantics of some of them. - - * Transmit checksumming - -For complete description, see comments near the top of include/linux/skbuff.h. - -Note: NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM + NETIF_F_IPV6_CSUM. -It means that device can fill TCP/UDP-like checksum anywhere in the packets -whatever headers there might be. - - * Transmit TCP segmentation offload - -NETIF_F_TSO_ECN means that hardware can properly split packets with CWR bit -set, be it TCPv4 (when NETIF_F_TSO is enabled) or TCPv6 (NETIF_F_TSO6). - - * Transmit UDP segmentation offload - -NETIF_F_GSO_UDP_L4 accepts a single UDP header with a payload that exceeds -gso_size. On segmentation, it segments the payload on gso_size boundaries and -replicates the network and UDP headers (fixing up the last one if less than -gso_size). - - * Transmit DMA from high memory - -On platforms where this is relevant, NETIF_F_HIGHDMA signals that -ndo_start_xmit can handle skbs with frags in high memory. - - * Transmit scatter-gather - -Those features say that ndo_start_xmit can handle fragmented skbs: -NETIF_F_SG --- paged skbs (skb_shinfo()->frags), NETIF_F_FRAGLIST --- -chained skbs (skb->next/prev list). - - * Software features - -Features contained in NETIF_F_SOFT_FEATURES are features of networking -stack. Driver should not change behaviour based on them. - - * LLTX driver (deprecated for hardware drivers) - -NETIF_F_LLTX is meant to be used by drivers that don't need locking at all, -e.g. software tunnels. - -This is also used in a few legacy drivers that implement their -own locking, don't use it for new (hardware) drivers. - - * netns-local device - -NETIF_F_NETNS_LOCAL is set for devices that are not allowed to move between -network namespaces (e.g. loopback). - -Don't use it in drivers. - - * VLAN challenged - -NETIF_F_VLAN_CHALLENGED should be set for devices which can't cope with VLAN -headers. Some drivers set this because the cards can't handle the bigger MTU. -[FIXME: Those cases could be fixed in VLAN code by allowing only reduced-MTU -VLANs. This may be not useful, though.] - -* rx-fcs - -This requests that the NIC append the Ethernet Frame Checksum (FCS) -to the end of the skb data. This allows sniffers and other tools to -read the CRC recorded by the NIC on receipt of the packet. - -* rx-all - -This requests that the NIC receive all possible frames, including errored -frames (such as bad FCS, etc). This can be helpful when sniffing a link with -bad packets on it. Some NICs may receive more packets if also put into normal -PROMISC mode. - -* rx-gro-hw - -This requests that the NIC enables Hardware GRO (generic receive offload). -Hardware GRO is basically the exact reverse of TSO, and is generally -stricter than Hardware LRO. A packet stream merged by Hardware GRO must -be re-segmentable by GSO or TSO back to the exact original packet stream. -Hardware GRO is dependent on RXCSUM since every packet successfully merged -by hardware must also have the checksum verified by hardware. diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h index 9d53c5ad272c..2cc3cf80b49a 100644 --- a/include/linux/netdev_features.h +++ b/include/linux/netdev_features.h @@ -89,7 +89,7 @@ enum { * Add your fresh new feature above and remember to update * netdev_features_strings[] in net/core/ethtool.c and maybe * some feature mask #defines below. Please also describe it - * in Documentation/networking/netdev-features.txt. + * in Documentation/networking/netdev-features.rst. */ /**/NETDEV_FEATURE_COUNT -- cgit v1.2.3 From 482a4360c56a1ecb5ea54c00db647b0012d787cf Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:04 +0200 Subject: docs: networking: convert netdevices.txt to ReST - add SPDX header; - adjust title markup; - mark lists as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/can.rst | 2 +- Documentation/networking/index.rst | 1 + Documentation/networking/netdevices.rst | 111 ++++++++++++++++++++++++++++++++ Documentation/networking/netdevices.txt | 104 ------------------------------ 4 files changed, 113 insertions(+), 105 deletions(-) create mode 100644 Documentation/networking/netdevices.rst delete mode 100644 Documentation/networking/netdevices.txt (limited to 'Documentation') diff --git a/Documentation/networking/can.rst b/Documentation/networking/can.rst index 2fd0b51a8c52..ff05cbd05e0d 100644 --- a/Documentation/networking/can.rst +++ b/Documentation/networking/can.rst @@ -1058,7 +1058,7 @@ drivers you mainly have to deal with: - TX: Put the CAN frame from the socket buffer to the CAN controller. - RX: Put the CAN frame from the CAN controller to the socket buffer. -See e.g. at Documentation/networking/netdevices.txt . The differences +See e.g. at Documentation/networking/netdevices.rst . The differences for writing CAN network device driver are described below: diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 4c6aa3db97d4..5a320553ffba 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -82,6 +82,7 @@ Contents: multiqueue netconsole netdev-features + netdevices .. only:: subproject and html diff --git a/Documentation/networking/netdevices.rst b/Documentation/networking/netdevices.rst new file mode 100644 index 000000000000..5a85fcc80c76 --- /dev/null +++ b/Documentation/networking/netdevices.rst @@ -0,0 +1,111 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================================== +Network Devices, the Kernel, and You! +===================================== + + +Introduction +============ +The following is a random collection of documentation regarding +network devices. + +struct net_device allocation rules +================================== +Network device structures need to persist even after module is unloaded and +must be allocated with alloc_netdev_mqs() and friends. +If device has registered successfully, it will be freed on last use +by free_netdev(). This is required to handle the pathologic case cleanly +(example: rmmod mydriver features this will be + called without holding netif_tx_lock. In this case the driver + has to lock by itself when needed. + The locking there should also properly protect against + set_rx_mode. WARNING: use of NETIF_F_LLTX is deprecated. + Don't use it for new drivers. + + Context: Process with BHs disabled or BH (timer), + will be called with interrupts disabled by netconsole. + + Return codes: + + * NETDEV_TX_OK everything ok. + * NETDEV_TX_BUSY Cannot transmit packet, try later + Usually a bug, means queue start/stop flow control is broken in + the driver. Note: the driver must NOT put the skb in its DMA ring. + +ndo_tx_timeout: + Synchronization: netif_tx_lock spinlock; all TX queues frozen. + Context: BHs disabled + Notes: netif_queue_stopped() is guaranteed true + +ndo_set_rx_mode: + Synchronization: netif_addr_lock spinlock. + Context: BHs disabled + +struct napi_struct synchronization rules +======================================== +napi->poll: + Synchronization: + NAPI_STATE_SCHED bit in napi->state. Device + driver's ndo_stop method will invoke napi_disable() on + all NAPI instances which will do a sleeping poll on the + NAPI_STATE_SCHED napi->state bit, waiting for all pending + NAPI activity to cease. + + Context: + softirq + will be called with interrupts disabled by netconsole. diff --git a/Documentation/networking/netdevices.txt b/Documentation/networking/netdevices.txt deleted file mode 100644 index 7fec2061a334..000000000000 --- a/Documentation/networking/netdevices.txt +++ /dev/null @@ -1,104 +0,0 @@ - -Network Devices, the Kernel, and You! - - -Introduction -============ -The following is a random collection of documentation regarding -network devices. - -struct net_device allocation rules -================================== -Network device structures need to persist even after module is unloaded and -must be allocated with alloc_netdev_mqs() and friends. -If device has registered successfully, it will be freed on last use -by free_netdev(). This is required to handle the pathologic case cleanly -(example: rmmod mydriver features this will be - called without holding netif_tx_lock. In this case the driver - has to lock by itself when needed. - The locking there should also properly protect against - set_rx_mode. WARNING: use of NETIF_F_LLTX is deprecated. - Don't use it for new drivers. - - Context: Process with BHs disabled or BH (timer), - will be called with interrupts disabled by netconsole. - - Return codes: - o NETDEV_TX_OK everything ok. - o NETDEV_TX_BUSY Cannot transmit packet, try later - Usually a bug, means queue start/stop flow control is broken in - the driver. Note: the driver must NOT put the skb in its DMA ring. - -ndo_tx_timeout: - Synchronization: netif_tx_lock spinlock; all TX queues frozen. - Context: BHs disabled - Notes: netif_queue_stopped() is guaranteed true - -ndo_set_rx_mode: - Synchronization: netif_addr_lock spinlock. - Context: BHs disabled - -struct napi_struct synchronization rules -======================================== -napi->poll: - Synchronization: NAPI_STATE_SCHED bit in napi->state. Device - driver's ndo_stop method will invoke napi_disable() on - all NAPI instances which will do a sleeping poll on the - NAPI_STATE_SCHED napi->state bit, waiting for all pending - NAPI activity to cease. - Context: softirq - will be called with interrupts disabled by netconsole. -- cgit v1.2.3 From 0191533087a3cfbe02a0bde9939fa978cb9761fd Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:05 +0200 Subject: docs: networking: convert netfilter-sysctl.txt to ReST Not much to be done here: - add SPDX header; - add a document title; - add a chapter markup; - mark tables as such; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/netfilter-sysctl.rst | 17 +++++++++++++++++ Documentation/networking/netfilter-sysctl.txt | 10 ---------- 3 files changed, 18 insertions(+), 10 deletions(-) create mode 100644 Documentation/networking/netfilter-sysctl.rst delete mode 100644 Documentation/networking/netfilter-sysctl.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 5a320553ffba..1ae0cbef8c04 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -83,6 +83,7 @@ Contents: netconsole netdev-features netdevices + netfilter-sysctl .. only:: subproject and html diff --git a/Documentation/networking/netfilter-sysctl.rst b/Documentation/networking/netfilter-sysctl.rst new file mode 100644 index 000000000000..beb6d7b275d4 --- /dev/null +++ b/Documentation/networking/netfilter-sysctl.rst @@ -0,0 +1,17 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================= +Netfilter Sysfs variables +========================= + +/proc/sys/net/netfilter/* Variables: +==================================== + +nf_log_all_netns - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + By default, only init_net namespace can log packets into kernel log + with LOG target; this aims to prevent containers from flooding host + kernel log. If enabled, this target also works in other network + namespaces. This variable is only accessible from init_net. diff --git a/Documentation/networking/netfilter-sysctl.txt b/Documentation/networking/netfilter-sysctl.txt deleted file mode 100644 index 55791e50e169..000000000000 --- a/Documentation/networking/netfilter-sysctl.txt +++ /dev/null @@ -1,10 +0,0 @@ -/proc/sys/net/netfilter/* Variables: - -nf_log_all_netns - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - By default, only init_net namespace can log packets into kernel log - with LOG target; this aims to prevent containers from flooding host - kernel log. If enabled, this target also works in other network - namespaces. This variable is only accessible from init_net. -- cgit v1.2.3 From c4d5dff60f0a142937b52626653314f3d0a18420 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:06 +0200 Subject: docs: networking: convert netif-msg.txt to ReST - add SPDX header; - adjust title and chapter markups; - mark lists as such; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/netif-msg.rst | 95 ++++++++++++++++++++++++++++++++++ Documentation/networking/netif-msg.txt | 79 ---------------------------- 3 files changed, 96 insertions(+), 79 deletions(-) create mode 100644 Documentation/networking/netif-msg.rst delete mode 100644 Documentation/networking/netif-msg.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 1ae0cbef8c04..d98509f15363 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -84,6 +84,7 @@ Contents: netdev-features netdevices netfilter-sysctl + netif-msg .. only:: subproject and html diff --git a/Documentation/networking/netif-msg.rst b/Documentation/networking/netif-msg.rst new file mode 100644 index 000000000000..b20d265a734d --- /dev/null +++ b/Documentation/networking/netif-msg.rst @@ -0,0 +1,95 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +NETIF Msg Level +=============== + +The design of the network interface message level setting. + +History +------- + + The design of the debugging message interface was guided and + constrained by backwards compatibility previous practice. It is useful + to understand the history and evolution in order to understand current + practice and relate it to older driver source code. + + From the beginning of Linux, each network device driver has had a local + integer variable that controls the debug message level. The message + level ranged from 0 to 7, and monotonically increased in verbosity. + + The message level was not precisely defined past level 3, but were + always implemented within +-1 of the specified level. Drivers tended + to shed the more verbose level messages as they matured. + + - 0 Minimal messages, only essential information on fatal errors. + - 1 Standard messages, initialization status. No run-time messages + - 2 Special media selection messages, generally timer-driver. + - 3 Interface starts and stops, including normal status messages + - 4 Tx and Rx frame error messages, and abnormal driver operation + - 5 Tx packet queue information, interrupt events. + - 6 Status on each completed Tx packet and received Rx packets + - 7 Initial contents of Tx and Rx packets + + Initially this message level variable was uniquely named in each driver + e.g. "lance_debug", so that a kernel symbolic debugger could locate and + modify the setting. When kernel modules became common, the variables + were consistently renamed to "debug" and allowed to be set as a module + parameter. + + This approach worked well. However there is always a demand for + additional features. Over the years the following emerged as + reasonable and easily implemented enhancements + + - Using an ioctl() call to modify the level. + - Per-interface rather than per-driver message level setting. + - More selective control over the type of messages emitted. + + The netif_msg recommendation adds these features with only a minor + complexity and code size increase. + + The recommendation is the following points + + - Retaining the per-driver integer variable "debug" as a module + parameter with a default level of '1'. + + - Adding a per-interface private variable named "msg_enable". The + variable is a bit map rather than a level, and is initialized as:: + + 1 << debug + + Or more precisely:: + + debug < 0 ? 0 : 1 << min(sizeof(int)-1, debug) + + Messages should changes from:: + + if (debug > 1) + printk(MSG_DEBUG "%s: ... + + to:: + + if (np->msg_enable & NETIF_MSG_LINK) + printk(MSG_DEBUG "%s: ... + + +The set of message levels is named + + + ========= =================== ============ + Old level Name Bit position + ========= =================== ============ + 0 NETIF_MSG_DRV 0x0001 + 1 NETIF_MSG_PROBE 0x0002 + 2 NETIF_MSG_LINK 0x0004 + 2 NETIF_MSG_TIMER 0x0004 + 3 NETIF_MSG_IFDOWN 0x0008 + 3 NETIF_MSG_IFUP 0x0008 + 4 NETIF_MSG_RX_ERR 0x0010 + 4 NETIF_MSG_TX_ERR 0x0010 + 5 NETIF_MSG_TX_QUEUED 0x0020 + 5 NETIF_MSG_INTR 0x0020 + 6 NETIF_MSG_TX_DONE 0x0040 + 6 NETIF_MSG_RX_STATUS 0x0040 + 7 NETIF_MSG_PKTDATA 0x0080 + ========= =================== ============ diff --git a/Documentation/networking/netif-msg.txt b/Documentation/networking/netif-msg.txt deleted file mode 100644 index c967ddb90d0b..000000000000 --- a/Documentation/networking/netif-msg.txt +++ /dev/null @@ -1,79 +0,0 @@ - -________________ -NETIF Msg Level - -The design of the network interface message level setting. - -History - - The design of the debugging message interface was guided and - constrained by backwards compatibility previous practice. It is useful - to understand the history and evolution in order to understand current - practice and relate it to older driver source code. - - From the beginning of Linux, each network device driver has had a local - integer variable that controls the debug message level. The message - level ranged from 0 to 7, and monotonically increased in verbosity. - - The message level was not precisely defined past level 3, but were - always implemented within +-1 of the specified level. Drivers tended - to shed the more verbose level messages as they matured. - 0 Minimal messages, only essential information on fatal errors. - 1 Standard messages, initialization status. No run-time messages - 2 Special media selection messages, generally timer-driver. - 3 Interface starts and stops, including normal status messages - 4 Tx and Rx frame error messages, and abnormal driver operation - 5 Tx packet queue information, interrupt events. - 6 Status on each completed Tx packet and received Rx packets - 7 Initial contents of Tx and Rx packets - - Initially this message level variable was uniquely named in each driver - e.g. "lance_debug", so that a kernel symbolic debugger could locate and - modify the setting. When kernel modules became common, the variables - were consistently renamed to "debug" and allowed to be set as a module - parameter. - - This approach worked well. However there is always a demand for - additional features. Over the years the following emerged as - reasonable and easily implemented enhancements - Using an ioctl() call to modify the level. - Per-interface rather than per-driver message level setting. - More selective control over the type of messages emitted. - - The netif_msg recommendation adds these features with only a minor - complexity and code size increase. - - The recommendation is the following points - Retaining the per-driver integer variable "debug" as a module - parameter with a default level of '1'. - - Adding a per-interface private variable named "msg_enable". The - variable is a bit map rather than a level, and is initialized as - 1 << debug - Or more precisely - debug < 0 ? 0 : 1 << min(sizeof(int)-1, debug) - - Messages should changes from - if (debug > 1) - printk(MSG_DEBUG "%s: ... - to - if (np->msg_enable & NETIF_MSG_LINK) - printk(MSG_DEBUG "%s: ... - - -The set of message levels is named - Old level Name Bit position - 0 NETIF_MSG_DRV 0x0001 - 1 NETIF_MSG_PROBE 0x0002 - 2 NETIF_MSG_LINK 0x0004 - 2 NETIF_MSG_TIMER 0x0004 - 3 NETIF_MSG_IFDOWN 0x0008 - 3 NETIF_MSG_IFUP 0x0008 - 4 NETIF_MSG_RX_ERR 0x0010 - 4 NETIF_MSG_TX_ERR 0x0010 - 5 NETIF_MSG_TX_QUEUED 0x0020 - 5 NETIF_MSG_INTR 0x0020 - 6 NETIF_MSG_TX_DONE 0x0040 - 6 NETIF_MSG_RX_STATUS 0x0040 - 7 NETIF_MSG_PKTDATA 0x0080 - -- cgit v1.2.3 From 13df433f8c1333cd0f60e5e1878380a37576118a Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:07 +0200 Subject: docs: networking: convert nf_conntrack-sysctl.txt to ReST - add SPDX header; - add a document title; - mark lists as such; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/nf_conntrack-sysctl.rst | 179 +++++++++++++++++++++++ Documentation/networking/nf_conntrack-sysctl.txt | 172 ---------------------- 3 files changed, 180 insertions(+), 172 deletions(-) create mode 100644 Documentation/networking/nf_conntrack-sysctl.rst delete mode 100644 Documentation/networking/nf_conntrack-sysctl.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index d98509f15363..e5128bb7e7df 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -85,6 +85,7 @@ Contents: netdevices netfilter-sysctl netif-msg + nf_conntrack-sysctl .. only:: subproject and html diff --git a/Documentation/networking/nf_conntrack-sysctl.rst b/Documentation/networking/nf_conntrack-sysctl.rst new file mode 100644 index 000000000000..11a9b76786cb --- /dev/null +++ b/Documentation/networking/nf_conntrack-sysctl.rst @@ -0,0 +1,179 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================================== +Netfilter Conntrack Sysfs variables +=================================== + +/proc/sys/net/netfilter/nf_conntrack_* Variables: +================================================= + +nf_conntrack_acct - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + Enable connection tracking flow accounting. 64-bit byte and packet + counters per flow are added. + +nf_conntrack_buckets - INTEGER + Size of hash table. If not specified as parameter during module + loading, the default size is calculated by dividing total memory + by 16384 to determine the number of buckets but the hash table will + never have fewer than 32 and limited to 16384 buckets. For systems + with more than 4GB of memory it will be 65536 buckets. + This sysctl is only writeable in the initial net namespace. + +nf_conntrack_checksum - BOOLEAN + - 0 - disabled + - not 0 - enabled (default) + + Verify checksum of incoming packets. Packets with bad checksums are + in INVALID state. If this is enabled, such packets will not be + considered for connection tracking. + +nf_conntrack_count - INTEGER (read-only) + Number of currently allocated flow entries. + +nf_conntrack_events - BOOLEAN + - 0 - disabled + - not 0 - enabled (default) + + If this option is enabled, the connection tracking code will + provide userspace with connection tracking events via ctnetlink. + +nf_conntrack_expect_max - INTEGER + Maximum size of expectation table. Default value is + nf_conntrack_buckets / 256. Minimum is 1. + +nf_conntrack_frag6_high_thresh - INTEGER + default 262144 + + Maximum memory used to reassemble IPv6 fragments. When + nf_conntrack_frag6_high_thresh bytes of memory is allocated for this + purpose, the fragment handler will toss packets until + nf_conntrack_frag6_low_thresh is reached. + +nf_conntrack_frag6_low_thresh - INTEGER + default 196608 + + See nf_conntrack_frag6_low_thresh + +nf_conntrack_frag6_timeout - INTEGER (seconds) + default 60 + + Time to keep an IPv6 fragment in memory. + +nf_conntrack_generic_timeout - INTEGER (seconds) + default 600 + + Default for generic timeout. This refers to layer 4 unknown/unsupported + protocols. + +nf_conntrack_helper - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + Enable automatic conntrack helper assignment. + If disabled it is required to set up iptables rules to assign + helpers to connections. See the CT target description in the + iptables-extensions(8) man page for further information. + +nf_conntrack_icmp_timeout - INTEGER (seconds) + default 30 + + Default for ICMP timeout. + +nf_conntrack_icmpv6_timeout - INTEGER (seconds) + default 30 + + Default for ICMP6 timeout. + +nf_conntrack_log_invalid - INTEGER + - 0 - disable (default) + - 1 - log ICMP packets + - 6 - log TCP packets + - 17 - log UDP packets + - 33 - log DCCP packets + - 41 - log ICMPv6 packets + - 136 - log UDPLITE packets + - 255 - log packets of any protocol + + Log invalid packets of a type specified by value. + +nf_conntrack_max - INTEGER + Size of connection tracking table. Default value is + nf_conntrack_buckets value * 4. + +nf_conntrack_tcp_be_liberal - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + Be conservative in what you do, be liberal in what you accept from others. + If it's non-zero, we mark only out of window RST segments as INVALID. + +nf_conntrack_tcp_loose - BOOLEAN + - 0 - disabled + - not 0 - enabled (default) + + If it is set to zero, we disable picking up already established + connections. + +nf_conntrack_tcp_max_retrans - INTEGER + default 3 + + Maximum number of packets that can be retransmitted without + received an (acceptable) ACK from the destination. If this number + is reached, a shorter timer will be started. + +nf_conntrack_tcp_timeout_close - INTEGER (seconds) + default 10 + +nf_conntrack_tcp_timeout_close_wait - INTEGER (seconds) + default 60 + +nf_conntrack_tcp_timeout_established - INTEGER (seconds) + default 432000 (5 days) + +nf_conntrack_tcp_timeout_fin_wait - INTEGER (seconds) + default 120 + +nf_conntrack_tcp_timeout_last_ack - INTEGER (seconds) + default 30 + +nf_conntrack_tcp_timeout_max_retrans - INTEGER (seconds) + default 300 + +nf_conntrack_tcp_timeout_syn_recv - INTEGER (seconds) + default 60 + +nf_conntrack_tcp_timeout_syn_sent - INTEGER (seconds) + default 120 + +nf_conntrack_tcp_timeout_time_wait - INTEGER (seconds) + default 120 + +nf_conntrack_tcp_timeout_unacknowledged - INTEGER (seconds) + default 300 + +nf_conntrack_timestamp - BOOLEAN + - 0 - disabled (default) + - not 0 - enabled + + Enable connection tracking flow timestamping. + +nf_conntrack_udp_timeout - INTEGER (seconds) + default 30 + +nf_conntrack_udp_timeout_stream - INTEGER (seconds) + default 120 + + This extended timeout will be used in case there is an UDP stream + detected. + +nf_conntrack_gre_timeout - INTEGER (seconds) + default 30 + +nf_conntrack_gre_timeout_stream - INTEGER (seconds) + default 180 + + This extended timeout will be used in case there is an GRE stream + detected. diff --git a/Documentation/networking/nf_conntrack-sysctl.txt b/Documentation/networking/nf_conntrack-sysctl.txt deleted file mode 100644 index f75c2ce6e136..000000000000 --- a/Documentation/networking/nf_conntrack-sysctl.txt +++ /dev/null @@ -1,172 +0,0 @@ -/proc/sys/net/netfilter/nf_conntrack_* Variables: - -nf_conntrack_acct - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - Enable connection tracking flow accounting. 64-bit byte and packet - counters per flow are added. - -nf_conntrack_buckets - INTEGER - Size of hash table. If not specified as parameter during module - loading, the default size is calculated by dividing total memory - by 16384 to determine the number of buckets but the hash table will - never have fewer than 32 and limited to 16384 buckets. For systems - with more than 4GB of memory it will be 65536 buckets. - This sysctl is only writeable in the initial net namespace. - -nf_conntrack_checksum - BOOLEAN - 0 - disabled - not 0 - enabled (default) - - Verify checksum of incoming packets. Packets with bad checksums are - in INVALID state. If this is enabled, such packets will not be - considered for connection tracking. - -nf_conntrack_count - INTEGER (read-only) - Number of currently allocated flow entries. - -nf_conntrack_events - BOOLEAN - 0 - disabled - not 0 - enabled (default) - - If this option is enabled, the connection tracking code will - provide userspace with connection tracking events via ctnetlink. - -nf_conntrack_expect_max - INTEGER - Maximum size of expectation table. Default value is - nf_conntrack_buckets / 256. Minimum is 1. - -nf_conntrack_frag6_high_thresh - INTEGER - default 262144 - - Maximum memory used to reassemble IPv6 fragments. When - nf_conntrack_frag6_high_thresh bytes of memory is allocated for this - purpose, the fragment handler will toss packets until - nf_conntrack_frag6_low_thresh is reached. - -nf_conntrack_frag6_low_thresh - INTEGER - default 196608 - - See nf_conntrack_frag6_low_thresh - -nf_conntrack_frag6_timeout - INTEGER (seconds) - default 60 - - Time to keep an IPv6 fragment in memory. - -nf_conntrack_generic_timeout - INTEGER (seconds) - default 600 - - Default for generic timeout. This refers to layer 4 unknown/unsupported - protocols. - -nf_conntrack_helper - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - Enable automatic conntrack helper assignment. - If disabled it is required to set up iptables rules to assign - helpers to connections. See the CT target description in the - iptables-extensions(8) man page for further information. - -nf_conntrack_icmp_timeout - INTEGER (seconds) - default 30 - - Default for ICMP timeout. - -nf_conntrack_icmpv6_timeout - INTEGER (seconds) - default 30 - - Default for ICMP6 timeout. - -nf_conntrack_log_invalid - INTEGER - 0 - disable (default) - 1 - log ICMP packets - 6 - log TCP packets - 17 - log UDP packets - 33 - log DCCP packets - 41 - log ICMPv6 packets - 136 - log UDPLITE packets - 255 - log packets of any protocol - - Log invalid packets of a type specified by value. - -nf_conntrack_max - INTEGER - Size of connection tracking table. Default value is - nf_conntrack_buckets value * 4. - -nf_conntrack_tcp_be_liberal - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - Be conservative in what you do, be liberal in what you accept from others. - If it's non-zero, we mark only out of window RST segments as INVALID. - -nf_conntrack_tcp_loose - BOOLEAN - 0 - disabled - not 0 - enabled (default) - - If it is set to zero, we disable picking up already established - connections. - -nf_conntrack_tcp_max_retrans - INTEGER - default 3 - - Maximum number of packets that can be retransmitted without - received an (acceptable) ACK from the destination. If this number - is reached, a shorter timer will be started. - -nf_conntrack_tcp_timeout_close - INTEGER (seconds) - default 10 - -nf_conntrack_tcp_timeout_close_wait - INTEGER (seconds) - default 60 - -nf_conntrack_tcp_timeout_established - INTEGER (seconds) - default 432000 (5 days) - -nf_conntrack_tcp_timeout_fin_wait - INTEGER (seconds) - default 120 - -nf_conntrack_tcp_timeout_last_ack - INTEGER (seconds) - default 30 - -nf_conntrack_tcp_timeout_max_retrans - INTEGER (seconds) - default 300 - -nf_conntrack_tcp_timeout_syn_recv - INTEGER (seconds) - default 60 - -nf_conntrack_tcp_timeout_syn_sent - INTEGER (seconds) - default 120 - -nf_conntrack_tcp_timeout_time_wait - INTEGER (seconds) - default 120 - -nf_conntrack_tcp_timeout_unacknowledged - INTEGER (seconds) - default 300 - -nf_conntrack_timestamp - BOOLEAN - 0 - disabled (default) - not 0 - enabled - - Enable connection tracking flow timestamping. - -nf_conntrack_udp_timeout - INTEGER (seconds) - default 30 - -nf_conntrack_udp_timeout_stream - INTEGER (seconds) - default 120 - - This extended timeout will be used in case there is an UDP stream - detected. - -nf_conntrack_gre_timeout - INTEGER (seconds) - default 30 - -nf_conntrack_gre_timeout_stream - INTEGER (seconds) - default 180 - - This extended timeout will be used in case there is an GRE stream - detected. -- cgit v1.2.3 From aa3764276a4bc1a927ab8feaad5a255234763d9d Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:08 +0200 Subject: docs: networking: convert nf_flowtable.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - add notes markups; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/nf_flowtable.rst | 117 ++++++++++++++++++++++++++++++ Documentation/networking/nf_flowtable.txt | 112 ---------------------------- 3 files changed, 118 insertions(+), 112 deletions(-) create mode 100644 Documentation/networking/nf_flowtable.rst delete mode 100644 Documentation/networking/nf_flowtable.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index e5128bb7e7df..c4e8a43741be 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -86,6 +86,7 @@ Contents: netfilter-sysctl netif-msg nf_conntrack-sysctl + nf_flowtable .. only:: subproject and html diff --git a/Documentation/networking/nf_flowtable.rst b/Documentation/networking/nf_flowtable.rst new file mode 100644 index 000000000000..b6e1fa141aae --- /dev/null +++ b/Documentation/networking/nf_flowtable.rst @@ -0,0 +1,117 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================================== +Netfilter's flowtable infrastructure +==================================== + +This documentation describes the software flowtable infrastructure available in +Netfilter since Linux kernel 4.16. + +Overview +-------- + +Initial packets follow the classic forwarding path, once the flow enters the +established state according to the conntrack semantics (ie. we have seen traffic +in both directions), then you can decide to offload the flow to the flowtable +from the forward chain via the 'flow offload' action available in nftables. + +Packets that find an entry in the flowtable (ie. flowtable hit) are sent to the +output netdevice via neigh_xmit(), hence, they bypass the classic forwarding +path (the visible effect is that you do not see these packets from any of the +netfilter hooks coming after the ingress). In case of flowtable miss, the packet +follows the classic forward path. + +The flowtable uses a resizable hashtable, lookups are based on the following +7-tuple selectors: source, destination, layer 3 and layer 4 protocols, source +and destination ports and the input interface (useful in case there are several +conntrack zones in place). + +Flowtables are populated via the 'flow offload' nftables action, so the user can +selectively specify what flows are placed into the flow table. Hence, packets +follow the classic forwarding path unless the user explicitly instruct packets +to use this new alternative forwarding path via nftables policy. + +This is represented in Fig.1, which describes the classic forwarding path +including the Netfilter hooks and the flowtable fastpath bypass. + +:: + + userspace process + ^ | + | | + _____|____ ____\/___ + / \ / \ + | input | | output | + \__________/ \_________/ + ^ | + | | + _________ __________ --------- _____\/_____ + / \ / \ |Routing | / \ + --> ingress ---> prerouting ---> |decision| | postrouting |--> neigh_xmit + \_________/ \__________/ ---------- \____________/ ^ + | ^ | ^ | + flowtable | ____\/___ | | + | | / \ | | + __\/___ | | forward |------------ | + |-----| | \_________/ | + |-----| | 'flow offload' rule | + |-----| | adds entry to | + |_____| | flowtable | + | | | + / \ | | + /hit\_no_| | + \ ? / | + \ / | + |__yes_________________fastpath bypass ____________________________| + + Fig.1 Netfilter hooks and flowtable interactions + +The flowtable entry also stores the NAT configuration, so all packets are +mangled according to the NAT policy that matches the initial packets that went +through the classic forwarding path. The TTL is decremented before calling +neigh_xmit(). Fragmented traffic is passed up to follow the classic forwarding +path given that the transport selectors are missing, therefore flowtable lookup +is not possible. + +Example configuration +--------------------- + +Enabling the flowtable bypass is relatively easy, you only need to create a +flowtable and add one rule to your forward chain:: + + table inet x { + flowtable f { + hook ingress priority 0; devices = { eth0, eth1 }; + } + chain y { + type filter hook forward priority 0; policy accept; + ip protocol tcp flow offload @f + counter packets 0 bytes 0 + } + } + +This example adds the flowtable 'f' to the ingress hook of the eth0 and eth1 +netdevices. You can create as many flowtables as you want in case you need to +perform resource partitioning. The flowtable priority defines the order in which +hooks are run in the pipeline, this is convenient in case you already have a +nftables ingress chain (make sure the flowtable priority is smaller than the +nftables ingress chain hence the flowtable runs before in the pipeline). + +The 'flow offload' action from the forward chain 'y' adds an entry to the +flowtable for the TCP syn-ack packet coming in the reply direction. Once the +flow is offloaded, you will observe that the counter rule in the example above +does not get updated for the packets that are being forwarded through the +forwarding bypass. + +More reading +------------ + +This documentation is based on the LWN.net articles [1]_\ [2]_. Rafal Milecki +also made a very complete and comprehensive summary called "A state of network +acceleration" that describes how things were before this infrastructure was +mailined [3]_ and it also makes a rough summary of this work [4]_. + +.. [1] https://lwn.net/Articles/738214/ +.. [2] https://lwn.net/Articles/742164/ +.. [3] http://lists.infradead.org/pipermail/lede-dev/2018-January/010830.html +.. [4] http://lists.infradead.org/pipermail/lede-dev/2018-January/010829.html diff --git a/Documentation/networking/nf_flowtable.txt b/Documentation/networking/nf_flowtable.txt deleted file mode 100644 index 0bf32d1121be..000000000000 --- a/Documentation/networking/nf_flowtable.txt +++ /dev/null @@ -1,112 +0,0 @@ -Netfilter's flowtable infrastructure -==================================== - -This documentation describes the software flowtable infrastructure available in -Netfilter since Linux kernel 4.16. - -Overview --------- - -Initial packets follow the classic forwarding path, once the flow enters the -established state according to the conntrack semantics (ie. we have seen traffic -in both directions), then you can decide to offload the flow to the flowtable -from the forward chain via the 'flow offload' action available in nftables. - -Packets that find an entry in the flowtable (ie. flowtable hit) are sent to the -output netdevice via neigh_xmit(), hence, they bypass the classic forwarding -path (the visible effect is that you do not see these packets from any of the -netfilter hooks coming after the ingress). In case of flowtable miss, the packet -follows the classic forward path. - -The flowtable uses a resizable hashtable, lookups are based on the following -7-tuple selectors: source, destination, layer 3 and layer 4 protocols, source -and destination ports and the input interface (useful in case there are several -conntrack zones in place). - -Flowtables are populated via the 'flow offload' nftables action, so the user can -selectively specify what flows are placed into the flow table. Hence, packets -follow the classic forwarding path unless the user explicitly instruct packets -to use this new alternative forwarding path via nftables policy. - -This is represented in Fig.1, which describes the classic forwarding path -including the Netfilter hooks and the flowtable fastpath bypass. - - userspace process - ^ | - | | - _____|____ ____\/___ - / \ / \ - | input | | output | - \__________/ \_________/ - ^ | - | | - _________ __________ --------- _____\/_____ - / \ / \ |Routing | / \ - --> ingress ---> prerouting ---> |decision| | postrouting |--> neigh_xmit - \_________/ \__________/ ---------- \____________/ ^ - | ^ | ^ | - flowtable | ____\/___ | | - | | / \ | | - __\/___ | | forward |------------ | - |-----| | \_________/ | - |-----| | 'flow offload' rule | - |-----| | adds entry to | - |_____| | flowtable | - | | | - / \ | | - /hit\_no_| | - \ ? / | - \ / | - |__yes_________________fastpath bypass ____________________________| - - Fig.1 Netfilter hooks and flowtable interactions - -The flowtable entry also stores the NAT configuration, so all packets are -mangled according to the NAT policy that matches the initial packets that went -through the classic forwarding path. The TTL is decremented before calling -neigh_xmit(). Fragmented traffic is passed up to follow the classic forwarding -path given that the transport selectors are missing, therefore flowtable lookup -is not possible. - -Example configuration ---------------------- - -Enabling the flowtable bypass is relatively easy, you only need to create a -flowtable and add one rule to your forward chain. - - table inet x { - flowtable f { - hook ingress priority 0; devices = { eth0, eth1 }; - } - chain y { - type filter hook forward priority 0; policy accept; - ip protocol tcp flow offload @f - counter packets 0 bytes 0 - } - } - -This example adds the flowtable 'f' to the ingress hook of the eth0 and eth1 -netdevices. You can create as many flowtables as you want in case you need to -perform resource partitioning. The flowtable priority defines the order in which -hooks are run in the pipeline, this is convenient in case you already have a -nftables ingress chain (make sure the flowtable priority is smaller than the -nftables ingress chain hence the flowtable runs before in the pipeline). - -The 'flow offload' action from the forward chain 'y' adds an entry to the -flowtable for the TCP syn-ack packet coming in the reply direction. Once the -flow is offloaded, you will observe that the counter rule in the example above -does not get updated for the packets that are being forwarded through the -forwarding bypass. - -More reading ------------- - -This documentation is based on the LWN.net articles [1][2]. Rafal Milecki also -made a very complete and comprehensive summary called "A state of network -acceleration" that describes how things were before this infrastructure was -mailined [3] and it also makes a rough summary of this work [4]. - -[1] https://lwn.net/Articles/738214/ -[2] https://lwn.net/Articles/742164/ -[3] http://lists.infradead.org/pipermail/lede-dev/2018-January/010830.html -[4] http://lists.infradead.org/pipermail/lede-dev/2018-January/010829.html -- cgit v1.2.3 From 63893472d753e97204331e568a5e70290054cb38 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:09 +0200 Subject: docs: networking: convert openvswitch.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/openvswitch.rst | 251 +++++++++++++++++++++++++++++++ Documentation/networking/openvswitch.txt | 248 ------------------------------ 3 files changed, 252 insertions(+), 248 deletions(-) create mode 100644 Documentation/networking/openvswitch.rst delete mode 100644 Documentation/networking/openvswitch.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index c4e8a43741be..b7f558480aca 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -87,6 +87,7 @@ Contents: netif-msg nf_conntrack-sysctl nf_flowtable + openvswitch .. only:: subproject and html diff --git a/Documentation/networking/openvswitch.rst b/Documentation/networking/openvswitch.rst new file mode 100644 index 000000000000..1a8353dbf1b6 --- /dev/null +++ b/Documentation/networking/openvswitch.rst @@ -0,0 +1,251 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================= +Open vSwitch datapath developer documentation +============================================= + +The Open vSwitch kernel module allows flexible userspace control over +flow-level packet processing on selected network devices. It can be +used to implement a plain Ethernet switch, network device bonding, +VLAN processing, network access control, flow-based network control, +and so on. + +The kernel module implements multiple "datapaths" (analogous to +bridges), each of which can have multiple "vports" (analogous to ports +within a bridge). Each datapath also has associated with it a "flow +table" that userspace populates with "flows" that map from keys based +on packet headers and metadata to sets of actions. The most common +action forwards the packet to another vport; other actions are also +implemented. + +When a packet arrives on a vport, the kernel module processes it by +extracting its flow key and looking it up in the flow table. If there +is a matching flow, it executes the associated actions. If there is +no match, it queues the packet to userspace for processing (as part of +its processing, userspace will likely set up a flow to handle further +packets of the same type entirely in-kernel). + + +Flow key compatibility +---------------------- + +Network protocols evolve over time. New protocols become important +and existing protocols lose their prominence. For the Open vSwitch +kernel module to remain relevant, it must be possible for newer +versions to parse additional protocols as part of the flow key. It +might even be desirable, someday, to drop support for parsing +protocols that have become obsolete. Therefore, the Netlink interface +to Open vSwitch is designed to allow carefully written userspace +applications to work with any version of the flow key, past or future. + +To support this forward and backward compatibility, whenever the +kernel module passes a packet to userspace, it also passes along the +flow key that it parsed from the packet. Userspace then extracts its +own notion of a flow key from the packet and compares it against the +kernel-provided version: + + - If userspace's notion of the flow key for the packet matches the + kernel's, then nothing special is necessary. + + - If the kernel's flow key includes more fields than the userspace + version of the flow key, for example if the kernel decoded IPv6 + headers but userspace stopped at the Ethernet type (because it + does not understand IPv6), then again nothing special is + necessary. Userspace can still set up a flow in the usual way, + as long as it uses the kernel-provided flow key to do it. + + - If the userspace flow key includes more fields than the + kernel's, for example if userspace decoded an IPv6 header but + the kernel stopped at the Ethernet type, then userspace can + forward the packet manually, without setting up a flow in the + kernel. This case is bad for performance because every packet + that the kernel considers part of the flow must go to userspace, + but the forwarding behavior is correct. (If userspace can + determine that the values of the extra fields would not affect + forwarding behavior, then it could set up a flow anyway.) + +How flow keys evolve over time is important to making this work, so +the following sections go into detail. + + +Flow key format +--------------- + +A flow key is passed over a Netlink socket as a sequence of Netlink +attributes. Some attributes represent packet metadata, defined as any +information about a packet that cannot be extracted from the packet +itself, e.g. the vport on which the packet was received. Most +attributes, however, are extracted from headers within the packet, +e.g. source and destination addresses from Ethernet, IP, or TCP +headers. + +The header file defines the exact format of the +flow key attributes. For informal explanatory purposes here, we write +them as comma-separated strings, with parentheses indicating arguments +and nesting. For example, the following could represent a flow key +corresponding to a TCP packet that arrived on vport 1:: + + in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4), + eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0, + frag=no), tcp(src=49163, dst=80) + +Often we ellipsize arguments not important to the discussion, e.g.:: + + in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...) + + +Wildcarded flow key format +-------------------------- + +A wildcarded flow is described with two sequences of Netlink attributes +passed over the Netlink socket. A flow key, exactly as described above, and an +optional corresponding flow mask. + +A wildcarded flow can represent a group of exact match flows. Each '1' bit +in the mask specifies a exact match with the corresponding bit in the flow key. +A '0' bit specifies a don't care bit, which will match either a '1' or '0' bit +of a incoming packet. Using wildcarded flow can improve the flow set up rate +by reduce the number of new flows need to be processed by the user space program. + +Support for the mask Netlink attribute is optional for both the kernel and user +space program. The kernel can ignore the mask attribute, installing an exact +match flow, or reduce the number of don't care bits in the kernel to less than +what was specified by the user space program. In this case, variations in bits +that the kernel does not implement will simply result in additional flow setups. +The kernel module will also work with user space programs that neither support +nor supply flow mask attributes. + +Since the kernel may ignore or modify wildcard bits, it can be difficult for +the userspace program to know exactly what matches are installed. There are +two possible approaches: reactively install flows as they miss the kernel +flow table (and therefore not attempt to determine wildcard changes at all) +or use the kernel's response messages to determine the installed wildcards. + +When interacting with userspace, the kernel should maintain the match portion +of the key exactly as originally installed. This will provides a handle to +identify the flow for all future operations. However, when reporting the +mask of an installed flow, the mask should include any restrictions imposed +by the kernel. + +The behavior when using overlapping wildcarded flows is undefined. It is the +responsibility of the user space program to ensure that any incoming packet +can match at most one flow, wildcarded or not. The current implementation +performs best-effort detection of overlapping wildcarded flows and may reject +some but not all of them. However, this behavior may change in future versions. + + +Unique flow identifiers +----------------------- + +An alternative to using the original match portion of a key as the handle for +flow identification is a unique flow identifier, or "UFID". UFIDs are optional +for both the kernel and user space program. + +User space programs that support UFID are expected to provide it during flow +setup in addition to the flow, then refer to the flow using the UFID for all +future operations. The kernel is not required to index flows by the original +flow key if a UFID is specified. + + +Basic rule for evolving flow keys +--------------------------------- + +Some care is needed to really maintain forward and backward +compatibility for applications that follow the rules listed under +"Flow key compatibility" above. + +The basic rule is obvious:: + + ================================================================== + New network protocol support must only supplement existing flow + key attributes. It must not change the meaning of already defined + flow key attributes. + ================================================================== + +This rule does have less-obvious consequences so it is worth working +through a few examples. Suppose, for example, that the kernel module +did not already implement VLAN parsing. Instead, it just interpreted +the 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the +packet. The flow key for any packet with an 802.1Q header would look +essentially like this, ignoring metadata:: + + eth(...), eth_type(0x8100) + +Naively, to add VLAN support, it makes sense to add a new "vlan" flow +key attribute to contain the VLAN tag, then continue to decode the +encapsulated headers beyond the VLAN tag using the existing field +definitions. With this change, a TCP packet in VLAN 10 would have a +flow key much like this:: + + eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...) + +But this change would negatively affect a userspace application that +has not been updated to understand the new "vlan" flow key attribute. +The application could, following the flow compatibility rules above, +ignore the "vlan" attribute that it does not understand and therefore +assume that the flow contained IP packets. This is a bad assumption +(the flow only contains IP packets if one parses and skips over the +802.1Q header) and it could cause the application's behavior to change +across kernel versions even though it follows the compatibility rules. + +The solution is to use a set of nested attributes. This is, for +example, why 802.1Q support uses nested attributes. A TCP packet in +VLAN 10 is actually expressed as:: + + eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800), + ip(proto=6, ...), tcp(...))) + +Notice how the "eth_type", "ip", and "tcp" flow key attributes are +nested inside the "encap" attribute. Thus, an application that does +not understand the "vlan" key will not see either of those attributes +and therefore will not misinterpret them. (Also, the outer eth_type +is still 0x8100, not changed to 0x0800.) + +Handling malformed packets +-------------------------- + +Don't drop packets in the kernel for malformed protocol headers, bad +checksums, etc. This would prevent userspace from implementing a +simple Ethernet switch that forwards every packet. + +Instead, in such a case, include an attribute with "empty" content. +It doesn't matter if the empty content could be valid protocol values, +as long as those values are rarely seen in practice, because userspace +can always forward all packets with those values to userspace and +handle them individually. + +For example, consider a packet that contains an IP header that +indicates protocol 6 for TCP, but which is truncated just after the IP +header, so that the TCP header is missing. The flow key for this +packet would include a tcp attribute with all-zero src and dst, like +this:: + + eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0) + +As another example, consider a packet with an Ethernet type of 0x8100, +indicating that a VLAN TCI should follow, but which is truncated just +after the Ethernet type. The flow key for this packet would include +an all-zero-bits vlan and an empty encap attribute, like this:: + + eth(...), eth_type(0x8100), vlan(0), encap() + +Unlike a TCP packet with source and destination ports 0, an +all-zero-bits VLAN TCI is not that rare, so the CFI bit (aka +VLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan +attribute expressly to allow this situation to be distinguished. +Thus, the flow key in this second example unambiguously indicates a +missing or malformed VLAN TCI. + +Other rules +----------- + +The other rules for flow keys are much less subtle: + + - Duplicate attributes are not allowed at a given nesting level. + + - Ordering of attributes is not significant. + + - When the kernel sends a given flow key to userspace, it always + composes it the same way. This allows userspace to hash and + compare entire flow keys that it may not be able to fully + interpret. diff --git a/Documentation/networking/openvswitch.txt b/Documentation/networking/openvswitch.txt deleted file mode 100644 index b3b9ac61d29d..000000000000 --- a/Documentation/networking/openvswitch.txt +++ /dev/null @@ -1,248 +0,0 @@ -Open vSwitch datapath developer documentation -============================================= - -The Open vSwitch kernel module allows flexible userspace control over -flow-level packet processing on selected network devices. It can be -used to implement a plain Ethernet switch, network device bonding, -VLAN processing, network access control, flow-based network control, -and so on. - -The kernel module implements multiple "datapaths" (analogous to -bridges), each of which can have multiple "vports" (analogous to ports -within a bridge). Each datapath also has associated with it a "flow -table" that userspace populates with "flows" that map from keys based -on packet headers and metadata to sets of actions. The most common -action forwards the packet to another vport; other actions are also -implemented. - -When a packet arrives on a vport, the kernel module processes it by -extracting its flow key and looking it up in the flow table. If there -is a matching flow, it executes the associated actions. If there is -no match, it queues the packet to userspace for processing (as part of -its processing, userspace will likely set up a flow to handle further -packets of the same type entirely in-kernel). - - -Flow key compatibility ----------------------- - -Network protocols evolve over time. New protocols become important -and existing protocols lose their prominence. For the Open vSwitch -kernel module to remain relevant, it must be possible for newer -versions to parse additional protocols as part of the flow key. It -might even be desirable, someday, to drop support for parsing -protocols that have become obsolete. Therefore, the Netlink interface -to Open vSwitch is designed to allow carefully written userspace -applications to work with any version of the flow key, past or future. - -To support this forward and backward compatibility, whenever the -kernel module passes a packet to userspace, it also passes along the -flow key that it parsed from the packet. Userspace then extracts its -own notion of a flow key from the packet and compares it against the -kernel-provided version: - - - If userspace's notion of the flow key for the packet matches the - kernel's, then nothing special is necessary. - - - If the kernel's flow key includes more fields than the userspace - version of the flow key, for example if the kernel decoded IPv6 - headers but userspace stopped at the Ethernet type (because it - does not understand IPv6), then again nothing special is - necessary. Userspace can still set up a flow in the usual way, - as long as it uses the kernel-provided flow key to do it. - - - If the userspace flow key includes more fields than the - kernel's, for example if userspace decoded an IPv6 header but - the kernel stopped at the Ethernet type, then userspace can - forward the packet manually, without setting up a flow in the - kernel. This case is bad for performance because every packet - that the kernel considers part of the flow must go to userspace, - but the forwarding behavior is correct. (If userspace can - determine that the values of the extra fields would not affect - forwarding behavior, then it could set up a flow anyway.) - -How flow keys evolve over time is important to making this work, so -the following sections go into detail. - - -Flow key format ---------------- - -A flow key is passed over a Netlink socket as a sequence of Netlink -attributes. Some attributes represent packet metadata, defined as any -information about a packet that cannot be extracted from the packet -itself, e.g. the vport on which the packet was received. Most -attributes, however, are extracted from headers within the packet, -e.g. source and destination addresses from Ethernet, IP, or TCP -headers. - -The header file defines the exact format of the -flow key attributes. For informal explanatory purposes here, we write -them as comma-separated strings, with parentheses indicating arguments -and nesting. For example, the following could represent a flow key -corresponding to a TCP packet that arrived on vport 1: - - in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4), - eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0, - frag=no), tcp(src=49163, dst=80) - -Often we ellipsize arguments not important to the discussion, e.g.: - - in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...) - - -Wildcarded flow key format --------------------------- - -A wildcarded flow is described with two sequences of Netlink attributes -passed over the Netlink socket. A flow key, exactly as described above, and an -optional corresponding flow mask. - -A wildcarded flow can represent a group of exact match flows. Each '1' bit -in the mask specifies a exact match with the corresponding bit in the flow key. -A '0' bit specifies a don't care bit, which will match either a '1' or '0' bit -of a incoming packet. Using wildcarded flow can improve the flow set up rate -by reduce the number of new flows need to be processed by the user space program. - -Support for the mask Netlink attribute is optional for both the kernel and user -space program. The kernel can ignore the mask attribute, installing an exact -match flow, or reduce the number of don't care bits in the kernel to less than -what was specified by the user space program. In this case, variations in bits -that the kernel does not implement will simply result in additional flow setups. -The kernel module will also work with user space programs that neither support -nor supply flow mask attributes. - -Since the kernel may ignore or modify wildcard bits, it can be difficult for -the userspace program to know exactly what matches are installed. There are -two possible approaches: reactively install flows as they miss the kernel -flow table (and therefore not attempt to determine wildcard changes at all) -or use the kernel's response messages to determine the installed wildcards. - -When interacting with userspace, the kernel should maintain the match portion -of the key exactly as originally installed. This will provides a handle to -identify the flow for all future operations. However, when reporting the -mask of an installed flow, the mask should include any restrictions imposed -by the kernel. - -The behavior when using overlapping wildcarded flows is undefined. It is the -responsibility of the user space program to ensure that any incoming packet -can match at most one flow, wildcarded or not. The current implementation -performs best-effort detection of overlapping wildcarded flows and may reject -some but not all of them. However, this behavior may change in future versions. - - -Unique flow identifiers ------------------------ - -An alternative to using the original match portion of a key as the handle for -flow identification is a unique flow identifier, or "UFID". UFIDs are optional -for both the kernel and user space program. - -User space programs that support UFID are expected to provide it during flow -setup in addition to the flow, then refer to the flow using the UFID for all -future operations. The kernel is not required to index flows by the original -flow key if a UFID is specified. - - -Basic rule for evolving flow keys ---------------------------------- - -Some care is needed to really maintain forward and backward -compatibility for applications that follow the rules listed under -"Flow key compatibility" above. - -The basic rule is obvious: - - ------------------------------------------------------------------ - New network protocol support must only supplement existing flow - key attributes. It must not change the meaning of already defined - flow key attributes. - ------------------------------------------------------------------ - -This rule does have less-obvious consequences so it is worth working -through a few examples. Suppose, for example, that the kernel module -did not already implement VLAN parsing. Instead, it just interpreted -the 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the -packet. The flow key for any packet with an 802.1Q header would look -essentially like this, ignoring metadata: - - eth(...), eth_type(0x8100) - -Naively, to add VLAN support, it makes sense to add a new "vlan" flow -key attribute to contain the VLAN tag, then continue to decode the -encapsulated headers beyond the VLAN tag using the existing field -definitions. With this change, a TCP packet in VLAN 10 would have a -flow key much like this: - - eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...) - -But this change would negatively affect a userspace application that -has not been updated to understand the new "vlan" flow key attribute. -The application could, following the flow compatibility rules above, -ignore the "vlan" attribute that it does not understand and therefore -assume that the flow contained IP packets. This is a bad assumption -(the flow only contains IP packets if one parses and skips over the -802.1Q header) and it could cause the application's behavior to change -across kernel versions even though it follows the compatibility rules. - -The solution is to use a set of nested attributes. This is, for -example, why 802.1Q support uses nested attributes. A TCP packet in -VLAN 10 is actually expressed as: - - eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800), - ip(proto=6, ...), tcp(...))) - -Notice how the "eth_type", "ip", and "tcp" flow key attributes are -nested inside the "encap" attribute. Thus, an application that does -not understand the "vlan" key will not see either of those attributes -and therefore will not misinterpret them. (Also, the outer eth_type -is still 0x8100, not changed to 0x0800.) - -Handling malformed packets --------------------------- - -Don't drop packets in the kernel for malformed protocol headers, bad -checksums, etc. This would prevent userspace from implementing a -simple Ethernet switch that forwards every packet. - -Instead, in such a case, include an attribute with "empty" content. -It doesn't matter if the empty content could be valid protocol values, -as long as those values are rarely seen in practice, because userspace -can always forward all packets with those values to userspace and -handle them individually. - -For example, consider a packet that contains an IP header that -indicates protocol 6 for TCP, but which is truncated just after the IP -header, so that the TCP header is missing. The flow key for this -packet would include a tcp attribute with all-zero src and dst, like -this: - - eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0) - -As another example, consider a packet with an Ethernet type of 0x8100, -indicating that a VLAN TCI should follow, but which is truncated just -after the Ethernet type. The flow key for this packet would include -an all-zero-bits vlan and an empty encap attribute, like this: - - eth(...), eth_type(0x8100), vlan(0), encap() - -Unlike a TCP packet with source and destination ports 0, an -all-zero-bits VLAN TCI is not that rare, so the CFI bit (aka -VLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan -attribute expressly to allow this situation to be distinguished. -Thus, the flow key in this second example unambiguously indicates a -missing or malformed VLAN TCI. - -Other rules ------------ - -The other rules for flow keys are much less subtle: - - - Duplicate attributes are not allowed at a given nesting level. - - - Ordering of attributes is not significant. - - - When the kernel sends a given flow key to userspace, it always - composes it the same way. This allows userspace to hash and - compare entire flow keys that it may not be able to fully - interpret. -- cgit v1.2.3 From f5c39ef3299f4efaab4d1bb410406de5909c1687 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:10 +0200 Subject: docs: networking: convert operstates.txt to ReST - add SPDX header; - add a document title; - adjust chapters, adding proper markups; - mark lists as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/operstates.rst | 185 ++++++++++++++++++++++++++++++++ Documentation/networking/operstates.txt | 164 ---------------------------- 3 files changed, 186 insertions(+), 164 deletions(-) create mode 100644 Documentation/networking/operstates.rst delete mode 100644 Documentation/networking/operstates.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index b7f558480aca..028a36821b9a 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -88,6 +88,7 @@ Contents: nf_conntrack-sysctl nf_flowtable openvswitch + operstates .. only:: subproject and html diff --git a/Documentation/networking/operstates.rst b/Documentation/networking/operstates.rst new file mode 100644 index 000000000000..9c918f7cb0e8 --- /dev/null +++ b/Documentation/networking/operstates.rst @@ -0,0 +1,185 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================== +Operational States +================== + + +1. Introduction +=============== + +Linux distinguishes between administrative and operational state of an +interface. Administrative state is the result of "ip link set dev + up or down" and reflects whether the administrator wants to use +the device for traffic. + +However, an interface is not usable just because the admin enabled it +- ethernet requires to be plugged into the switch and, depending on +a site's networking policy and configuration, an 802.1X authentication +to be performed before user data can be transferred. Operational state +shows the ability of an interface to transmit this user data. + +Thanks to 802.1X, userspace must be granted the possibility to +influence operational state. To accommodate this, operational state is +split into two parts: Two flags that can be set by the driver only, and +a RFC2863 compatible state that is derived from these flags, a policy, +and changeable from userspace under certain rules. + + +2. Querying from userspace +========================== + +Both admin and operational state can be queried via the netlink +operation RTM_GETLINK. It is also possible to subscribe to RTNLGRP_LINK +to be notified of updates while the interface is admin up. This is +important for setting from userspace. + +These values contain interface state: + +ifinfomsg::if_flags & IFF_UP: + Interface is admin up + +ifinfomsg::if_flags & IFF_RUNNING: + Interface is in RFC2863 operational state UP or UNKNOWN. This is for + backward compatibility, routing daemons, dhcp clients can use this + flag to determine whether they should use the interface. + +ifinfomsg::if_flags & IFF_LOWER_UP: + Driver has signaled netif_carrier_on() + +ifinfomsg::if_flags & IFF_DORMANT: + Driver has signaled netif_dormant_on() + +TLV IFLA_OPERSTATE +------------------ + +contains RFC2863 state of the interface in numeric representation: + +IF_OPER_UNKNOWN (0): + Interface is in unknown state, neither driver nor userspace has set + operational state. Interface must be considered for user data as + setting operational state has not been implemented in every driver. + +IF_OPER_NOTPRESENT (1): + Unused in current kernel (notpresent interfaces normally disappear), + just a numerical placeholder. + +IF_OPER_DOWN (2): + Interface is unable to transfer data on L1, f.e. ethernet is not + plugged or interface is ADMIN down. + +IF_OPER_LOWERLAYERDOWN (3): + Interfaces stacked on an interface that is IF_OPER_DOWN show this + state (f.e. VLAN). + +IF_OPER_TESTING (4): + Unused in current kernel. + +IF_OPER_DORMANT (5): + Interface is L1 up, but waiting for an external event, f.e. for a + protocol to establish. (802.1X) + +IF_OPER_UP (6): + Interface is operational up and can be used. + +This TLV can also be queried via sysfs. + +TLV IFLA_LINKMODE +----------------- + +contains link policy. This is needed for userspace interaction +described below. + +This TLV can also be queried via sysfs. + + +3. Kernel driver API +==================== + +Kernel drivers have access to two flags that map to IFF_LOWER_UP and +IFF_DORMANT. These flags can be set from everywhere, even from +interrupts. It is guaranteed that only the driver has write access, +however, if different layers of the driver manipulate the same flag, +the driver has to provide the synchronisation needed. + +__LINK_STATE_NOCARRIER, maps to !IFF_LOWER_UP: + +The driver uses netif_carrier_on() to clear and netif_carrier_off() to +set this flag. On netif_carrier_off(), the scheduler stops sending +packets. The name 'carrier' and the inversion are historical, think of +it as lower layer. + +Note that for certain kind of soft-devices, which are not managing any +real hardware, it is possible to set this bit from userspace. One +should use TVL IFLA_CARRIER to do so. + +netif_carrier_ok() can be used to query that bit. + +__LINK_STATE_DORMANT, maps to IFF_DORMANT: + +Set by the driver to express that the device cannot yet be used +because some driver controlled protocol establishment has to +complete. Corresponding functions are netif_dormant_on() to set the +flag, netif_dormant_off() to clear it and netif_dormant() to query. + +On device allocation, both flags __LINK_STATE_NOCARRIER and +__LINK_STATE_DORMANT are cleared, so the effective state is equivalent +to netif_carrier_ok() and !netif_dormant(). + + +Whenever the driver CHANGES one of these flags, a workqueue event is +scheduled to translate the flag combination to IFLA_OPERSTATE as +follows: + +!netif_carrier_ok(): + IF_OPER_LOWERLAYERDOWN if the interface is stacked, IF_OPER_DOWN + otherwise. Kernel can recognise stacked interfaces because their + ifindex != iflink. + +netif_carrier_ok() && netif_dormant(): + IF_OPER_DORMANT + +netif_carrier_ok() && !netif_dormant(): + IF_OPER_UP if userspace interaction is disabled. Otherwise + IF_OPER_DORMANT with the possibility for userspace to initiate the + IF_OPER_UP transition afterwards. + + +4. Setting from userspace +========================= + +Applications have to use the netlink interface to influence the +RFC2863 operational state of an interface. Setting IFLA_LINKMODE to 1 +via RTM_SETLINK instructs the kernel that an interface should go to +IF_OPER_DORMANT instead of IF_OPER_UP when the combination +netif_carrier_ok() && !netif_dormant() is set by the +driver. Afterwards, the userspace application can set IFLA_OPERSTATE +to IF_OPER_DORMANT or IF_OPER_UP as long as the driver does not set +netif_carrier_off() or netif_dormant_on(). Changes made by userspace +are multicasted on the netlink group RTNLGRP_LINK. + +So basically a 802.1X supplicant interacts with the kernel like this: + +- subscribe to RTNLGRP_LINK +- set IFLA_LINKMODE to 1 via RTM_SETLINK +- query RTM_GETLINK once to get initial state +- if initial flags are not (IFF_LOWER_UP && !IFF_DORMANT), wait until + netlink multicast signals this state +- do 802.1X, eventually abort if flags go down again +- send RTM_SETLINK to set operstate to IF_OPER_UP if authentication + succeeds, IF_OPER_DORMANT otherwise +- see how operstate and IFF_RUNNING is echoed via netlink multicast +- set interface back to IF_OPER_DORMANT if 802.1X reauthentication + fails +- restart if kernel changes IFF_LOWER_UP or IFF_DORMANT flag + +if supplicant goes down, bring back IFLA_LINKMODE to 0 and +IFLA_OPERSTATE to a sane value. + +A routing daemon or dhcp client just needs to care for IFF_RUNNING or +waiting for operstate to go IF_OPER_UP/IF_OPER_UNKNOWN before +considering the interface / querying a DHCP address. + + +For technical questions and/or comments please e-mail to Stefan Rompf +(stefan at loplof.de). diff --git a/Documentation/networking/operstates.txt b/Documentation/networking/operstates.txt deleted file mode 100644 index b203d1334822..000000000000 --- a/Documentation/networking/operstates.txt +++ /dev/null @@ -1,164 +0,0 @@ - -1. Introduction - -Linux distinguishes between administrative and operational state of an -interface. Administrative state is the result of "ip link set dev - up or down" and reflects whether the administrator wants to use -the device for traffic. - -However, an interface is not usable just because the admin enabled it -- ethernet requires to be plugged into the switch and, depending on -a site's networking policy and configuration, an 802.1X authentication -to be performed before user data can be transferred. Operational state -shows the ability of an interface to transmit this user data. - -Thanks to 802.1X, userspace must be granted the possibility to -influence operational state. To accommodate this, operational state is -split into two parts: Two flags that can be set by the driver only, and -a RFC2863 compatible state that is derived from these flags, a policy, -and changeable from userspace under certain rules. - - -2. Querying from userspace - -Both admin and operational state can be queried via the netlink -operation RTM_GETLINK. It is also possible to subscribe to RTNLGRP_LINK -to be notified of updates while the interface is admin up. This is -important for setting from userspace. - -These values contain interface state: - -ifinfomsg::if_flags & IFF_UP: - Interface is admin up -ifinfomsg::if_flags & IFF_RUNNING: - Interface is in RFC2863 operational state UP or UNKNOWN. This is for - backward compatibility, routing daemons, dhcp clients can use this - flag to determine whether they should use the interface. -ifinfomsg::if_flags & IFF_LOWER_UP: - Driver has signaled netif_carrier_on() -ifinfomsg::if_flags & IFF_DORMANT: - Driver has signaled netif_dormant_on() - -TLV IFLA_OPERSTATE - -contains RFC2863 state of the interface in numeric representation: - -IF_OPER_UNKNOWN (0): - Interface is in unknown state, neither driver nor userspace has set - operational state. Interface must be considered for user data as - setting operational state has not been implemented in every driver. -IF_OPER_NOTPRESENT (1): - Unused in current kernel (notpresent interfaces normally disappear), - just a numerical placeholder. -IF_OPER_DOWN (2): - Interface is unable to transfer data on L1, f.e. ethernet is not - plugged or interface is ADMIN down. -IF_OPER_LOWERLAYERDOWN (3): - Interfaces stacked on an interface that is IF_OPER_DOWN show this - state (f.e. VLAN). -IF_OPER_TESTING (4): - Unused in current kernel. -IF_OPER_DORMANT (5): - Interface is L1 up, but waiting for an external event, f.e. for a - protocol to establish. (802.1X) -IF_OPER_UP (6): - Interface is operational up and can be used. - -This TLV can also be queried via sysfs. - -TLV IFLA_LINKMODE - -contains link policy. This is needed for userspace interaction -described below. - -This TLV can also be queried via sysfs. - - -3. Kernel driver API - -Kernel drivers have access to two flags that map to IFF_LOWER_UP and -IFF_DORMANT. These flags can be set from everywhere, even from -interrupts. It is guaranteed that only the driver has write access, -however, if different layers of the driver manipulate the same flag, -the driver has to provide the synchronisation needed. - -__LINK_STATE_NOCARRIER, maps to !IFF_LOWER_UP: - -The driver uses netif_carrier_on() to clear and netif_carrier_off() to -set this flag. On netif_carrier_off(), the scheduler stops sending -packets. The name 'carrier' and the inversion are historical, think of -it as lower layer. - -Note that for certain kind of soft-devices, which are not managing any -real hardware, it is possible to set this bit from userspace. One -should use TVL IFLA_CARRIER to do so. - -netif_carrier_ok() can be used to query that bit. - -__LINK_STATE_DORMANT, maps to IFF_DORMANT: - -Set by the driver to express that the device cannot yet be used -because some driver controlled protocol establishment has to -complete. Corresponding functions are netif_dormant_on() to set the -flag, netif_dormant_off() to clear it and netif_dormant() to query. - -On device allocation, both flags __LINK_STATE_NOCARRIER and -__LINK_STATE_DORMANT are cleared, so the effective state is equivalent -to netif_carrier_ok() and !netif_dormant(). - - -Whenever the driver CHANGES one of these flags, a workqueue event is -scheduled to translate the flag combination to IFLA_OPERSTATE as -follows: - -!netif_carrier_ok(): - IF_OPER_LOWERLAYERDOWN if the interface is stacked, IF_OPER_DOWN - otherwise. Kernel can recognise stacked interfaces because their - ifindex != iflink. - -netif_carrier_ok() && netif_dormant(): - IF_OPER_DORMANT - -netif_carrier_ok() && !netif_dormant(): - IF_OPER_UP if userspace interaction is disabled. Otherwise - IF_OPER_DORMANT with the possibility for userspace to initiate the - IF_OPER_UP transition afterwards. - - -4. Setting from userspace - -Applications have to use the netlink interface to influence the -RFC2863 operational state of an interface. Setting IFLA_LINKMODE to 1 -via RTM_SETLINK instructs the kernel that an interface should go to -IF_OPER_DORMANT instead of IF_OPER_UP when the combination -netif_carrier_ok() && !netif_dormant() is set by the -driver. Afterwards, the userspace application can set IFLA_OPERSTATE -to IF_OPER_DORMANT or IF_OPER_UP as long as the driver does not set -netif_carrier_off() or netif_dormant_on(). Changes made by userspace -are multicasted on the netlink group RTNLGRP_LINK. - -So basically a 802.1X supplicant interacts with the kernel like this: - --subscribe to RTNLGRP_LINK --set IFLA_LINKMODE to 1 via RTM_SETLINK --query RTM_GETLINK once to get initial state --if initial flags are not (IFF_LOWER_UP && !IFF_DORMANT), wait until - netlink multicast signals this state --do 802.1X, eventually abort if flags go down again --send RTM_SETLINK to set operstate to IF_OPER_UP if authentication - succeeds, IF_OPER_DORMANT otherwise --see how operstate and IFF_RUNNING is echoed via netlink multicast --set interface back to IF_OPER_DORMANT if 802.1X reauthentication - fails --restart if kernel changes IFF_LOWER_UP or IFF_DORMANT flag - -if supplicant goes down, bring back IFLA_LINKMODE to 0 and -IFLA_OPERSTATE to a sane value. - -A routing daemon or dhcp client just needs to care for IFF_RUNNING or -waiting for operstate to go IF_OPER_UP/IF_OPER_UNKNOWN before -considering the interface / querying a DHCP address. - - -For technical questions and/or comments please e-mail to Stefan Rompf -(stefan at loplof.de). -- cgit v1.2.3 From 4ba7bc9f2de6b230da2e38e6317de4b5ed91f46b Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:11 +0200 Subject: docs: networking: convert packet_mmap.txt to ReST This patch has a big diff, but most are due to whitespaces. Yet, the conversion is similar to other files under networking: - add SPDX header; - add a document title; - adjust titles and chapters, adding proper markups; - mark lists as such; - mark tables as such; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/packet_mmap.rst | 1084 ++++++++++++++++++++++++++++++ Documentation/networking/packet_mmap.txt | 1061 ----------------------------- 3 files changed, 1085 insertions(+), 1061 deletions(-) create mode 100644 Documentation/networking/packet_mmap.rst delete mode 100644 Documentation/networking/packet_mmap.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 028a36821b9a..8262b535a83e 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -89,6 +89,7 @@ Contents: nf_flowtable openvswitch operstates + packet_mmap .. only:: subproject and html diff --git a/Documentation/networking/packet_mmap.rst b/Documentation/networking/packet_mmap.rst new file mode 100644 index 000000000000..5f213d17652f --- /dev/null +++ b/Documentation/networking/packet_mmap.rst @@ -0,0 +1,1084 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========== +Packet MMAP +=========== + +Abstract +======== + +This file documents the mmap() facility available with the PACKET +socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for + +i) capture network traffic with utilities like tcpdump, +ii) transmit network traffic, or any other that needs raw + access to network interface. + +Howto can be found at: + + https://sites.google.com/site/packetmmap/ + +Please send your comments to + - Ulisses Alonso Camaró + - Johann Baudy + +Why use PACKET_MMAP +=================== + +In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very +inefficient. It uses very limited buffers and requires one system call to +capture each packet, it requires two if you want to get packet's timestamp +(like libpcap always does). + +In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size +configurable circular buffer mapped in user space that can be used to either +send or receive packets. This way reading packets just needs to wait for them, +most of the time there is no need to issue a single system call. Concerning +transmission, multiple packets can be sent through one system call to get the +highest bandwidth. By using a shared buffer between the kernel and the user +also has the benefit of minimizing packet copies. + +It's fine to use PACKET_MMAP to improve the performance of the capture and +transmission process, but it isn't everything. At least, if you are capturing +at high speeds (this is relative to the cpu speed), you should check if the +device driver of your network interface card supports some sort of interrupt +load mitigation or (even better) if it supports NAPI, also make sure it is +enabled. For transmission, check the MTU (Maximum Transmission Unit) used and +supported by devices of your network. CPU IRQ pinning of your network interface +card can also be an advantage. + +How to use mmap() to improve capture process +============================================ + +From the user standpoint, you should use the higher level libpcap library, which +is a de facto standard, portable across nearly all operating systems +including Win32. + +Packet MMAP support was integrated into libpcap around the time of version 1.3.0; +TPACKET_V3 support was added in version 1.5.0 + +How to use mmap() directly to improve capture process +===================================================== + +From the system calls stand point, the use of PACKET_MMAP involves +the following process:: + + + [setup] socket() -------> creation of the capture socket + setsockopt() ---> allocation of the circular buffer (ring) + option: PACKET_RX_RING + mmap() ---------> mapping of the allocated buffer to the + user process + + [capture] poll() ---------> to wait for incoming packets + + [shutdown] close() --------> destruction of the capture socket and + deallocation of all associated + resources. + + +socket creation and destruction is straight forward, and is done +the same way with or without PACKET_MMAP:: + + int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL)); + +where mode is SOCK_RAW for the raw interface were link level +information can be captured or SOCK_DGRAM for the cooked +interface where link level information capture is not +supported and a link level pseudo-header is provided +by the kernel. + +The destruction of the socket and all associated resources +is done by a simple call to close(fd). + +Similarly as without PACKET_MMAP, it is possible to use one socket +for capture and transmission. This can be done by mapping the +allocated RX and TX buffer ring with a single mmap() call. +See "Mapping and use of the circular buffer (ring)". + +Next I will describe PACKET_MMAP settings and its constraints, +also the mapping of the circular buffer in the user process and +the use of this buffer. + +How to use mmap() directly to improve transmission process +========================================================== +Transmission process is similar to capture as shown below:: + + [setup] socket() -------> creation of the transmission socket + setsockopt() ---> allocation of the circular buffer (ring) + option: PACKET_TX_RING + bind() ---------> bind transmission socket with a network interface + mmap() ---------> mapping of the allocated buffer to the + user process + + [transmission] poll() ---------> wait for free packets (optional) + send() ---------> send all packets that are set as ready in + the ring + The flag MSG_DONTWAIT can be used to return + before end of transfer. + + [shutdown] close() --------> destruction of the transmission socket and + deallocation of all associated resources. + +Socket creation and destruction is also straight forward, and is done +the same way as in capturing described in the previous paragraph:: + + int fd = socket(PF_PACKET, mode, 0); + +The protocol can optionally be 0 in case we only want to transmit +via this socket, which avoids an expensive call to packet_rcv(). +In this case, you also need to bind(2) the TX_RING with sll_protocol = 0 +set. Otherwise, htons(ETH_P_ALL) or any other protocol, for example. + +Binding the socket to your network interface is mandatory (with zero copy) to +know the header size of frames used in the circular buffer. + +As capture, each frame contains two parts:: + + -------------------- + | struct tpacket_hdr | Header. It contains the status of + | | of this frame + |--------------------| + | data buffer | + . . Data that will be sent over the network interface. + . . + -------------------- + + bind() associates the socket to your network interface thanks to + sll_ifindex parameter of struct sockaddr_ll. + + Initialization example:: + + struct sockaddr_ll my_addr; + struct ifreq s_ifr; + ... + + strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name)); + + /* get interface index of eth0 */ + ioctl(this->socket, SIOCGIFINDEX, &s_ifr); + + /* fill sockaddr_ll struct to prepare binding */ + my_addr.sll_family = AF_PACKET; + my_addr.sll_protocol = htons(ETH_P_ALL); + my_addr.sll_ifindex = s_ifr.ifr_ifindex; + + /* bind socket to eth0 */ + bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll)); + + A complete tutorial is available at: https://sites.google.com/site/packetmmap/ + +By default, the user should put data at:: + + frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll) + +So, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW), +the beginning of the user data will be at:: + + frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) + +If you wish to put user data at a custom offset from the beginning of +the frame (for payload alignment with SOCK_RAW mode for instance) you +can set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order +to make this work it must be enabled previously with setsockopt() +and the PACKET_TX_HAS_OFF option. + +PACKET_MMAP settings +==================== + +To setup PACKET_MMAP from user level code is done with a call like + + - Capture process:: + + setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req)) + + - Transmission process:: + + setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req)) + +The most significant argument in the previous call is the req parameter, +this parameter must to have the following structure:: + + struct tpacket_req + { + unsigned int tp_block_size; /* Minimal size of contiguous block */ + unsigned int tp_block_nr; /* Number of blocks */ + unsigned int tp_frame_size; /* Size of frame */ + unsigned int tp_frame_nr; /* Total number of frames */ + }; + +This structure is defined in /usr/include/linux/if_packet.h and establishes a +circular buffer (ring) of unswappable memory. +Being mapped in the capture process allows reading the captured frames and +related meta-information like timestamps without requiring a system call. + +Frames are grouped in blocks. Each block is a physically contiguous +region of memory and holds tp_block_size/tp_frame_size frames. The total number +of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because:: + + frames_per_block = tp_block_size/tp_frame_size + +indeed, packet_set_ring checks that the following condition is true:: + + frames_per_block * tp_block_nr == tp_frame_nr + +Lets see an example, with the following values:: + + tp_block_size= 4096 + tp_frame_size= 2048 + tp_block_nr = 4 + tp_frame_nr = 8 + +we will get the following buffer structure:: + + block #1 block #2 + +---------+---------+ +---------+---------+ + | frame 1 | frame 2 | | frame 3 | frame 4 | + +---------+---------+ +---------+---------+ + + block #3 block #4 + +---------+---------+ +---------+---------+ + | frame 5 | frame 6 | | frame 7 | frame 8 | + +---------+---------+ +---------+---------+ + +A frame can be of any size with the only condition it can fit in a block. A block +can only hold an integer number of frames, or in other words, a frame cannot +be spawned across two blocks, so there are some details you have to take into +account when choosing the frame_size. See "Mapping and use of the circular +buffer (ring)". + +PACKET_MMAP setting constraints +=============================== + +In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch), +the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or +16384 in a 64 bit architecture. For information on these kernel versions +see http://pusa.uv.es/~ulisses/packet_mmap/packet_mmap.pre-2.4.26_2.6.5.txt + +Block size limit +---------------- + +As stated earlier, each block is a contiguous physical region of memory. These +memory regions are allocated with calls to the __get_free_pages() function. As +the name indicates, this function allocates pages of memory, and the second +argument is "order" or a power of two number of pages, that is +(for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes, +order=2 ==> 16384 bytes, etc. The maximum size of a +region allocated by __get_free_pages is determined by the MAX_ORDER macro. More +precisely the limit can be calculated as:: + + PAGE_SIZE << MAX_ORDER + + In a i386 architecture PAGE_SIZE is 4096 bytes + In a 2.4/i386 kernel MAX_ORDER is 10 + In a 2.6/i386 kernel MAX_ORDER is 11 + +So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel +respectively, with an i386 architecture. + +User space programs can include /usr/include/sys/user.h and +/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations. + +The pagesize can also be determined dynamically with the getpagesize (2) +system call. + +Block number limit +------------------ + +To understand the constraints of PACKET_MMAP, we have to see the structure +used to hold the pointers to each block. + +Currently, this structure is a dynamically allocated vector with kmalloc +called pg_vec, its size limits the number of blocks that can be allocated:: + + +---+---+---+---+ + | x | x | x | x | + +---+---+---+---+ + | | | | + | | | v + | | v block #4 + | v block #3 + v block #2 + block #1 + +kmalloc allocates any number of bytes of physically contiguous memory from +a pool of pre-determined sizes. This pool of memory is maintained by the slab +allocator which is at the end the responsible for doing the allocation and +hence which imposes the maximum memory that kmalloc can allocate. + +In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The +predetermined sizes that kmalloc uses can be checked in the "size-" +entries of /proc/slabinfo + +In a 32 bit architecture, pointers are 4 bytes long, so the total number of +pointers to blocks is:: + + 131072/4 = 32768 blocks + +PACKET_MMAP buffer size calculator +================================== + +Definitions: + +============== ================================================================ + is the maximum size of allocable with kmalloc + (see /proc/slabinfo) + depends on the architecture -- ``sizeof(void *)`` + depends on the architecture -- PAGE_SIZE or getpagesize (2) + is the value defined with MAX_ORDER + it's an upper bound of frame's capture size (more on this later) +============== ================================================================ + +from these definitions we will derive:: + + = / + = << + +so, the max buffer size is:: + + * + +and, the number of frames be:: + + * / + +Suppose the following parameters, which apply for 2.6 kernel and an +i386 architecture:: + + = 131072 bytes + = 4 bytes + = 4096 bytes + = 11 + +and a value for of 2048 bytes. These parameters will yield:: + + = 131072/4 = 32768 blocks + = 4096 << 11 = 8 MiB. + +and hence the buffer will have a 262144 MiB size. So it can hold +262144 MiB / 2048 bytes = 134217728 frames + +Actually, this buffer size is not possible with an i386 architecture. +Remember that the memory is allocated in kernel space, in the case of +an i386 kernel's memory size is limited to 1GiB. + +All memory allocations are not freed until the socket is closed. The memory +allocations are done with GFP_KERNEL priority, this basically means that +the allocation can wait and swap other process' memory in order to allocate +the necessary memory, so normally limits can be reached. + +Other constraints +----------------- + +If you check the source code you will see that what I draw here as a frame +is not only the link level frame. At the beginning of each frame there is a +header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame +meta information like timestamp. So what we draw here a frame it's really +the following (from include/linux/if_packet.h):: + + /* + Frame structure: + + - Start. Frame must be aligned to TPACKET_ALIGNMENT=16 + - struct tpacket_hdr + - pad to TPACKET_ALIGNMENT=16 + - struct sockaddr_ll + - Gap, chosen so that packet data (Start+tp_net) aligns to + TPACKET_ALIGNMENT=16 + - Start+tp_mac: [ Optional MAC header ] + - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16. + - Pad to align to TPACKET_ALIGNMENT=16 + */ + +The following are conditions that are checked in packet_set_ring + + - tp_block_size must be a multiple of PAGE_SIZE (1) + - tp_frame_size must be greater than TPACKET_HDRLEN (obvious) + - tp_frame_size must be a multiple of TPACKET_ALIGNMENT + - tp_frame_nr must be exactly frames_per_block*tp_block_nr + +Note that tp_block_size should be chosen to be a power of two or there will +be a waste of memory. + +Mapping and use of the circular buffer (ring) +--------------------------------------------- + +The mapping of the buffer in the user process is done with the conventional +mmap function. Even the circular buffer is compound of several physically +discontiguous blocks of memory, they are contiguous to the user space, hence +just one call to mmap is needed:: + + mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); + +If tp_frame_size is a divisor of tp_block_size frames will be +contiguously spaced by tp_frame_size bytes. If not, each +tp_block_size/tp_frame_size frames there will be a gap between +the frames. This is because a frame cannot be spawn across two +blocks. + +To use one socket for capture and transmission, the mapping of both the +RX and TX buffer ring has to be done with one call to mmap:: + + ... + setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &foo, sizeof(foo)); + setsockopt(fd, SOL_PACKET, PACKET_TX_RING, &bar, sizeof(bar)); + ... + rx_ring = mmap(0, size * 2, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); + tx_ring = rx_ring + size; + +RX must be the first as the kernel maps the TX ring memory right +after the RX one. + +At the beginning of each frame there is an status field (see +struct tpacket_hdr). If this field is 0 means that the frame is ready +to be used for the kernel, If not, there is a frame the user can read +and the following flags apply: + +Capture process +^^^^^^^^^^^^^^^ + + from include/linux/if_packet.h + + #define TP_STATUS_COPY (1 << 1) + #define TP_STATUS_LOSING (1 << 2) + #define TP_STATUS_CSUMNOTREADY (1 << 3) + #define TP_STATUS_CSUM_VALID (1 << 7) + +====================== ======================================================= +TP_STATUS_COPY This flag indicates that the frame (and associated + meta information) has been truncated because it's + larger than tp_frame_size. This packet can be + read entirely with recvfrom(). + + In order to make this work it must to be + enabled previously with setsockopt() and + the PACKET_COPY_THRESH option. + + The number of frames that can be buffered to + be read with recvfrom is limited like a normal socket. + See the SO_RCVBUF option in the socket (7) man page. + +TP_STATUS_LOSING indicates there were packet drops from last time + statistics where checked with getsockopt() and + the PACKET_STATISTICS option. + +TP_STATUS_CSUMNOTREADY currently it's used for outgoing IP packets which + its checksum will be done in hardware. So while + reading the packet we should not try to check the + checksum. + +TP_STATUS_CSUM_VALID This flag indicates that at least the transport + header checksum of the packet has been already + validated on the kernel side. If the flag is not set + then we are free to check the checksum by ourselves + provided that TP_STATUS_CSUMNOTREADY is also not set. +====================== ======================================================= + +for convenience there are also the following defines:: + + #define TP_STATUS_KERNEL 0 + #define TP_STATUS_USER 1 + +The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel +receives a packet it puts in the buffer and updates the status with +at least the TP_STATUS_USER flag. Then the user can read the packet, +once the packet is read the user must zero the status field, so the kernel +can use again that frame buffer. + +The user can use poll (any other variant should apply too) to check if new +packets are in the ring:: + + struct pollfd pfd; + + pfd.fd = fd; + pfd.revents = 0; + pfd.events = POLLIN|POLLRDNORM|POLLERR; + + if (status == TP_STATUS_KERNEL) + retval = poll(&pfd, 1, timeout); + +It doesn't incur in a race condition to first check the status value and +then poll for frames. + +Transmission process +^^^^^^^^^^^^^^^^^^^^ + +Those defines are also used for transmission:: + + #define TP_STATUS_AVAILABLE 0 // Frame is available + #define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send() + #define TP_STATUS_SENDING 2 // Frame is currently in transmission + #define TP_STATUS_WRONG_FORMAT 4 // Frame format is not correct + +First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a +packet, the user fills a data buffer of an available frame, sets tp_len to +current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST. +This can be done on multiple frames. Once the user is ready to transmit, it +calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are +forwarded to the network device. The kernel updates each status of sent +frames with TP_STATUS_SENDING until the end of transfer. + +At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE. + +:: + + header->tp_len = in_i_size; + header->tp_status = TP_STATUS_SEND_REQUEST; + retval = send(this->socket, NULL, 0, 0); + +The user can also use poll() to check if a buffer is available: + +(status == TP_STATUS_SENDING) + +:: + + struct pollfd pfd; + pfd.fd = fd; + pfd.revents = 0; + pfd.events = POLLOUT; + retval = poll(&pfd, 1, timeout); + +What TPACKET versions are available and when to use them? +========================================================= + +:: + + int val = tpacket_version; + setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); + getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); + +where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3. + +TPACKET_V1: + - Default if not otherwise specified by setsockopt(2) + - RX_RING, TX_RING available + +TPACKET_V1 --> TPACKET_V2: + - Made 64 bit clean due to unsigned long usage in TPACKET_V1 + structures, thus this also works on 64 bit kernel with 32 bit + userspace and the like + - Timestamp resolution in nanoseconds instead of microseconds + - RX_RING, TX_RING available + - VLAN metadata information available for packets + (TP_STATUS_VLAN_VALID, TP_STATUS_VLAN_TPID_VALID), + in the tpacket2_hdr structure: + + - TP_STATUS_VLAN_VALID bit being set into the tp_status field indicates + that the tp_vlan_tci field has valid VLAN TCI value + - TP_STATUS_VLAN_TPID_VALID bit being set into the tp_status field + indicates that the tp_vlan_tpid field has valid VLAN TPID value + + - How to switch to TPACKET_V2: + + 1. Replace struct tpacket_hdr by struct tpacket2_hdr + 2. Query header len and save + 3. Set protocol version to 2, set up ring as usual + 4. For getting the sockaddr_ll, + use ``(void *)hdr + TPACKET_ALIGN(hdrlen)`` instead of + ``(void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))`` + +TPACKET_V2 --> TPACKET_V3: + - Flexible buffer implementation for RX_RING: + 1. Blocks can be configured with non-static frame-size + 2. Read/poll is at a block-level (as opposed to packet-level) + 3. Added poll timeout to avoid indefinite user-space wait + on idle links + 4. Added user-configurable knobs: + + 4.1 block::timeout + 4.2 tpkt_hdr::sk_rxhash + + - RX Hash data available in user space + - TX_RING semantics are conceptually similar to TPACKET_V2; + use tpacket3_hdr instead of tpacket2_hdr, and TPACKET3_HDRLEN + instead of TPACKET2_HDRLEN. In the current implementation, + the tp_next_offset field in the tpacket3_hdr MUST be set to + zero, indicating that the ring does not hold variable sized frames. + Packets with non-zero values of tp_next_offset will be dropped. + +AF_PACKET fanout mode +===================== + +In the AF_PACKET fanout mode, packet reception can be load balanced among +processes. This also works in combination with mmap(2) on packet sockets. + +Currently implemented fanout policies are: + + - PACKET_FANOUT_HASH: schedule to socket by skb's packet hash + - PACKET_FANOUT_LB: schedule to socket by round-robin + - PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on + - PACKET_FANOUT_RND: schedule to socket by random selection + - PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another + - PACKET_FANOUT_QM: schedule to socket by skbs recorded queue_mapping + +Minimal example code by David S. Miller (try things like "./test eth0 hash", +"./test eth0 lb", etc.):: + + #include + #include + #include + #include + + #include + #include + #include + #include + + #include + + #include + #include + + #include + + static const char *device_name; + static int fanout_type; + static int fanout_id; + + #ifndef PACKET_FANOUT + # define PACKET_FANOUT 18 + # define PACKET_FANOUT_HASH 0 + # define PACKET_FANOUT_LB 1 + #endif + + static int setup_socket(void) + { + int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP)); + struct sockaddr_ll ll; + struct ifreq ifr; + int fanout_arg; + + if (fd < 0) { + perror("socket"); + return EXIT_FAILURE; + } + + memset(&ifr, 0, sizeof(ifr)); + strcpy(ifr.ifr_name, device_name); + err = ioctl(fd, SIOCGIFINDEX, &ifr); + if (err < 0) { + perror("SIOCGIFINDEX"); + return EXIT_FAILURE; + } + + memset(&ll, 0, sizeof(ll)); + ll.sll_family = AF_PACKET; + ll.sll_ifindex = ifr.ifr_ifindex; + err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); + if (err < 0) { + perror("bind"); + return EXIT_FAILURE; + } + + fanout_arg = (fanout_id | (fanout_type << 16)); + err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT, + &fanout_arg, sizeof(fanout_arg)); + if (err) { + perror("setsockopt"); + return EXIT_FAILURE; + } + + return fd; + } + + static void fanout_thread(void) + { + int fd = setup_socket(); + int limit = 10000; + + if (fd < 0) + exit(fd); + + while (limit-- > 0) { + char buf[1600]; + int err; + + err = read(fd, buf, sizeof(buf)); + if (err < 0) { + perror("read"); + exit(EXIT_FAILURE); + } + if ((limit % 10) == 0) + fprintf(stdout, "(%d) \n", getpid()); + } + + fprintf(stdout, "%d: Received 10000 packets\n", getpid()); + + close(fd); + exit(0); + } + + int main(int argc, char **argp) + { + int fd, err; + int i; + + if (argc != 3) { + fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]); + return EXIT_FAILURE; + } + + if (!strcmp(argp[2], "hash")) + fanout_type = PACKET_FANOUT_HASH; + else if (!strcmp(argp[2], "lb")) + fanout_type = PACKET_FANOUT_LB; + else { + fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]); + exit(EXIT_FAILURE); + } + + device_name = argp[1]; + fanout_id = getpid() & 0xffff; + + for (i = 0; i < 4; i++) { + pid_t pid = fork(); + + switch (pid) { + case 0: + fanout_thread(); + + case -1: + perror("fork"); + exit(EXIT_FAILURE); + } + } + + for (i = 0; i < 4; i++) { + int status; + + wait(&status); + } + + return 0; + } + +AF_PACKET TPACKET_V3 example +============================ + +AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame +sizes by doing it's own memory management. It is based on blocks where polling +works on a per block basis instead of per ring as in TPACKET_V2 and predecessor. + +It is said that TPACKET_V3 brings the following benefits: + + * ~15% - 20% reduction in CPU-usage + * ~20% increase in packet capture rate + * ~2x increase in packet density + * Port aggregation analysis + * Non static frame size to capture entire packet payload + +So it seems to be a good candidate to be used with packet fanout. + +Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile +it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.):: + + /* Written from scratch, but kernel-to-user space API usage + * dissected from lolpcap: + * Copyright 2011, Chetan Loke + * License: GPL, version 2.0 + */ + + #include + #include + #include + #include + #include + #include + #include + #include + #include + #include + #include + #include + #include + #include + #include + #include + #include + + #ifndef likely + # define likely(x) __builtin_expect(!!(x), 1) + #endif + #ifndef unlikely + # define unlikely(x) __builtin_expect(!!(x), 0) + #endif + + struct block_desc { + uint32_t version; + uint32_t offset_to_priv; + struct tpacket_hdr_v1 h1; + }; + + struct ring { + struct iovec *rd; + uint8_t *map; + struct tpacket_req3 req; + }; + + static unsigned long packets_total = 0, bytes_total = 0; + static sig_atomic_t sigint = 0; + + static void sighandler(int num) + { + sigint = 1; + } + + static int setup_socket(struct ring *ring, char *netdev) + { + int err, i, fd, v = TPACKET_V3; + struct sockaddr_ll ll; + unsigned int blocksiz = 1 << 22, framesiz = 1 << 11; + unsigned int blocknum = 64; + + fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); + if (fd < 0) { + perror("socket"); + exit(1); + } + + err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v)); + if (err < 0) { + perror("setsockopt"); + exit(1); + } + + memset(&ring->req, 0, sizeof(ring->req)); + ring->req.tp_block_size = blocksiz; + ring->req.tp_frame_size = framesiz; + ring->req.tp_block_nr = blocknum; + ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz; + ring->req.tp_retire_blk_tov = 60; + ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH; + + err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req, + sizeof(ring->req)); + if (err < 0) { + perror("setsockopt"); + exit(1); + } + + ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr, + PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0); + if (ring->map == MAP_FAILED) { + perror("mmap"); + exit(1); + } + + ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd)); + assert(ring->rd); + for (i = 0; i < ring->req.tp_block_nr; ++i) { + ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size); + ring->rd[i].iov_len = ring->req.tp_block_size; + } + + memset(&ll, 0, sizeof(ll)); + ll.sll_family = PF_PACKET; + ll.sll_protocol = htons(ETH_P_ALL); + ll.sll_ifindex = if_nametoindex(netdev); + ll.sll_hatype = 0; + ll.sll_pkttype = 0; + ll.sll_halen = 0; + + err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); + if (err < 0) { + perror("bind"); + exit(1); + } + + return fd; + } + + static void display(struct tpacket3_hdr *ppd) + { + struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac); + struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN); + + if (eth->h_proto == htons(ETH_P_IP)) { + struct sockaddr_in ss, sd; + char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST]; + + memset(&ss, 0, sizeof(ss)); + ss.sin_family = PF_INET; + ss.sin_addr.s_addr = ip->saddr; + getnameinfo((struct sockaddr *) &ss, sizeof(ss), + sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST); + + memset(&sd, 0, sizeof(sd)); + sd.sin_family = PF_INET; + sd.sin_addr.s_addr = ip->daddr; + getnameinfo((struct sockaddr *) &sd, sizeof(sd), + dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST); + + printf("%s -> %s, ", sbuff, dbuff); + } + + printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash); + } + + static void walk_block(struct block_desc *pbd, const int block_num) + { + int num_pkts = pbd->h1.num_pkts, i; + unsigned long bytes = 0; + struct tpacket3_hdr *ppd; + + ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd + + pbd->h1.offset_to_first_pkt); + for (i = 0; i < num_pkts; ++i) { + bytes += ppd->tp_snaplen; + display(ppd); + + ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd + + ppd->tp_next_offset); + } + + packets_total += num_pkts; + bytes_total += bytes; + } + + static void flush_block(struct block_desc *pbd) + { + pbd->h1.block_status = TP_STATUS_KERNEL; + } + + static void teardown_socket(struct ring *ring, int fd) + { + munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr); + free(ring->rd); + close(fd); + } + + int main(int argc, char **argp) + { + int fd, err; + socklen_t len; + struct ring ring; + struct pollfd pfd; + unsigned int block_num = 0, blocks = 64; + struct block_desc *pbd; + struct tpacket_stats_v3 stats; + + if (argc != 2) { + fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]); + return EXIT_FAILURE; + } + + signal(SIGINT, sighandler); + + memset(&ring, 0, sizeof(ring)); + fd = setup_socket(&ring, argp[argc - 1]); + assert(fd > 0); + + memset(&pfd, 0, sizeof(pfd)); + pfd.fd = fd; + pfd.events = POLLIN | POLLERR; + pfd.revents = 0; + + while (likely(!sigint)) { + pbd = (struct block_desc *) ring.rd[block_num].iov_base; + + if ((pbd->h1.block_status & TP_STATUS_USER) == 0) { + poll(&pfd, 1, -1); + continue; + } + + walk_block(pbd, block_num); + flush_block(pbd); + block_num = (block_num + 1) % blocks; + } + + len = sizeof(stats); + err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len); + if (err < 0) { + perror("getsockopt"); + exit(1); + } + + fflush(stdout); + printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n", + stats.tp_packets, bytes_total, stats.tp_drops, + stats.tp_freeze_q_cnt); + + teardown_socket(&ring, fd); + return 0; + } + +PACKET_QDISC_BYPASS +=================== + +If there is a requirement to load the network with many packets in a similar +fashion as pktgen does, you might set the following option after socket +creation:: + + int one = 1; + setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one)); + +This has the side-effect, that packets sent through PF_PACKET will bypass the +kernel's qdisc layer and are forcedly pushed to the driver directly. Meaning, +packet are not buffered, tc disciplines are ignored, increased loss can occur +and such packets are also not visible to other PF_PACKET sockets anymore. So, +you have been warned; generally, this can be useful for stress testing various +components of a system. + +On default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled +on PF_PACKET sockets. + +PACKET_TIMESTAMP +================ + +The PACKET_TIMESTAMP setting determines the source of the timestamp in +the packet meta information for mmap(2)ed RX_RING and TX_RINGs. If your +NIC is capable of timestamping packets in hardware, you can request those +hardware timestamps to be used. Note: you may need to enable the generation +of hardware timestamps with SIOCSHWTSTAMP (see related information from +Documentation/networking/timestamping.txt). + +PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING:: + + int req = SOF_TIMESTAMPING_RAW_HARDWARE; + setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req)) + +For the mmap(2)ed ring buffers, such timestamps are stored in the +``tpacket{,2,3}_hdr`` structure's tp_sec and ``tp_{n,u}sec`` members. +To determine what kind of timestamp has been reported, the tp_status field +is binary or'ed with the following possible bits ... + +:: + + TP_STATUS_TS_RAW_HARDWARE + TP_STATUS_TS_SOFTWARE + +... that are equivalent to its ``SOF_TIMESTAMPING_*`` counterparts. For the +RX_RING, if neither is set (i.e. PACKET_TIMESTAMP is not set), then a +software fallback was invoked *within* PF_PACKET's processing code (less +precise). + +Getting timestamps for the TX_RING works as follows: i) fill the ring frames, +ii) call sendto() e.g. in blocking mode, iii) wait for status of relevant +frames to be updated resp. the frame handed over to the application, iv) walk +through the frames to pick up the individual hw/sw timestamps. + +Only (!) if transmit timestamping is enabled, then these bits are combined +with binary | with TP_STATUS_AVAILABLE, so you must check for that in your +application (e.g. !(tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING)) +in a first step to see if the frame belongs to the application, and then +one can extract the type of timestamp in a second step from tp_status)! + +If you don't care about them, thus having it disabled, checking for +TP_STATUS_AVAILABLE resp. TP_STATUS_WRONG_FORMAT is sufficient. If in the +TX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec +members do not contain a valid value. For TX_RINGs, by default no timestamp +is generated! + +See include/linux/net_tstamp.h and Documentation/networking/timestamping.txt +for more information on hardware timestamps. + +Miscellaneous bits +================== + +- Packet sockets work well together with Linux socket filters, thus you also + might want to have a look at Documentation/networking/filter.txt + +THANKS +====== + + Jesse Brandeburg, for fixing my grammathical/spelling errors diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt deleted file mode 100644 index 494614573c67..000000000000 --- a/Documentation/networking/packet_mmap.txt +++ /dev/null @@ -1,1061 +0,0 @@ --------------------------------------------------------------------------------- -+ ABSTRACT --------------------------------------------------------------------------------- - -This file documents the mmap() facility available with the PACKET -socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for -i) capture network traffic with utilities like tcpdump, ii) transmit network -traffic, or any other that needs raw access to network interface. - -Howto can be found at: - https://sites.google.com/site/packetmmap/ - -Please send your comments to - Ulisses Alonso Camaró - Johann Baudy - -------------------------------------------------------------------------------- -+ Why use PACKET_MMAP --------------------------------------------------------------------------------- - -In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very -inefficient. It uses very limited buffers and requires one system call to -capture each packet, it requires two if you want to get packet's timestamp -(like libpcap always does). - -In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size -configurable circular buffer mapped in user space that can be used to either -send or receive packets. This way reading packets just needs to wait for them, -most of the time there is no need to issue a single system call. Concerning -transmission, multiple packets can be sent through one system call to get the -highest bandwidth. By using a shared buffer between the kernel and the user -also has the benefit of minimizing packet copies. - -It's fine to use PACKET_MMAP to improve the performance of the capture and -transmission process, but it isn't everything. At least, if you are capturing -at high speeds (this is relative to the cpu speed), you should check if the -device driver of your network interface card supports some sort of interrupt -load mitigation or (even better) if it supports NAPI, also make sure it is -enabled. For transmission, check the MTU (Maximum Transmission Unit) used and -supported by devices of your network. CPU IRQ pinning of your network interface -card can also be an advantage. - --------------------------------------------------------------------------------- -+ How to use mmap() to improve capture process --------------------------------------------------------------------------------- - -From the user standpoint, you should use the higher level libpcap library, which -is a de facto standard, portable across nearly all operating systems -including Win32. - -Packet MMAP support was integrated into libpcap around the time of version 1.3.0; -TPACKET_V3 support was added in version 1.5.0 - --------------------------------------------------------------------------------- -+ How to use mmap() directly to improve capture process --------------------------------------------------------------------------------- - -From the system calls stand point, the use of PACKET_MMAP involves -the following process: - - -[setup] socket() -------> creation of the capture socket - setsockopt() ---> allocation of the circular buffer (ring) - option: PACKET_RX_RING - mmap() ---------> mapping of the allocated buffer to the - user process - -[capture] poll() ---------> to wait for incoming packets - -[shutdown] close() --------> destruction of the capture socket and - deallocation of all associated - resources. - - -socket creation and destruction is straight forward, and is done -the same way with or without PACKET_MMAP: - - int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL)); - -where mode is SOCK_RAW for the raw interface were link level -information can be captured or SOCK_DGRAM for the cooked -interface where link level information capture is not -supported and a link level pseudo-header is provided -by the kernel. - -The destruction of the socket and all associated resources -is done by a simple call to close(fd). - -Similarly as without PACKET_MMAP, it is possible to use one socket -for capture and transmission. This can be done by mapping the -allocated RX and TX buffer ring with a single mmap() call. -See "Mapping and use of the circular buffer (ring)". - -Next I will describe PACKET_MMAP settings and its constraints, -also the mapping of the circular buffer in the user process and -the use of this buffer. - --------------------------------------------------------------------------------- -+ How to use mmap() directly to improve transmission process --------------------------------------------------------------------------------- -Transmission process is similar to capture as shown below. - -[setup] socket() -------> creation of the transmission socket - setsockopt() ---> allocation of the circular buffer (ring) - option: PACKET_TX_RING - bind() ---------> bind transmission socket with a network interface - mmap() ---------> mapping of the allocated buffer to the - user process - -[transmission] poll() ---------> wait for free packets (optional) - send() ---------> send all packets that are set as ready in - the ring - The flag MSG_DONTWAIT can be used to return - before end of transfer. - -[shutdown] close() --------> destruction of the transmission socket and - deallocation of all associated resources. - -Socket creation and destruction is also straight forward, and is done -the same way as in capturing described in the previous paragraph: - - int fd = socket(PF_PACKET, mode, 0); - -The protocol can optionally be 0 in case we only want to transmit -via this socket, which avoids an expensive call to packet_rcv(). -In this case, you also need to bind(2) the TX_RING with sll_protocol = 0 -set. Otherwise, htons(ETH_P_ALL) or any other protocol, for example. - -Binding the socket to your network interface is mandatory (with zero copy) to -know the header size of frames used in the circular buffer. - -As capture, each frame contains two parts: - - -------------------- -| struct tpacket_hdr | Header. It contains the status of -| | of this frame -|--------------------| -| data buffer | -. . Data that will be sent over the network interface. -. . - -------------------- - - bind() associates the socket to your network interface thanks to - sll_ifindex parameter of struct sockaddr_ll. - - Initialization example: - - struct sockaddr_ll my_addr; - struct ifreq s_ifr; - ... - - strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name)); - - /* get interface index of eth0 */ - ioctl(this->socket, SIOCGIFINDEX, &s_ifr); - - /* fill sockaddr_ll struct to prepare binding */ - my_addr.sll_family = AF_PACKET; - my_addr.sll_protocol = htons(ETH_P_ALL); - my_addr.sll_ifindex = s_ifr.ifr_ifindex; - - /* bind socket to eth0 */ - bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll)); - - A complete tutorial is available at: https://sites.google.com/site/packetmmap/ - -By default, the user should put data at : - frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll) - -So, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW), -the beginning of the user data will be at : - frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) - -If you wish to put user data at a custom offset from the beginning of -the frame (for payload alignment with SOCK_RAW mode for instance) you -can set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order -to make this work it must be enabled previously with setsockopt() -and the PACKET_TX_HAS_OFF option. - --------------------------------------------------------------------------------- -+ PACKET_MMAP settings --------------------------------------------------------------------------------- - -To setup PACKET_MMAP from user level code is done with a call like - - - Capture process - setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req)) - - Transmission process - setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req)) - -The most significant argument in the previous call is the req parameter, -this parameter must to have the following structure: - - struct tpacket_req - { - unsigned int tp_block_size; /* Minimal size of contiguous block */ - unsigned int tp_block_nr; /* Number of blocks */ - unsigned int tp_frame_size; /* Size of frame */ - unsigned int tp_frame_nr; /* Total number of frames */ - }; - -This structure is defined in /usr/include/linux/if_packet.h and establishes a -circular buffer (ring) of unswappable memory. -Being mapped in the capture process allows reading the captured frames and -related meta-information like timestamps without requiring a system call. - -Frames are grouped in blocks. Each block is a physically contiguous -region of memory and holds tp_block_size/tp_frame_size frames. The total number -of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because - - frames_per_block = tp_block_size/tp_frame_size - -indeed, packet_set_ring checks that the following condition is true - - frames_per_block * tp_block_nr == tp_frame_nr - -Lets see an example, with the following values: - - tp_block_size= 4096 - tp_frame_size= 2048 - tp_block_nr = 4 - tp_frame_nr = 8 - -we will get the following buffer structure: - - block #1 block #2 -+---------+---------+ +---------+---------+ -| frame 1 | frame 2 | | frame 3 | frame 4 | -+---------+---------+ +---------+---------+ - - block #3 block #4 -+---------+---------+ +---------+---------+ -| frame 5 | frame 6 | | frame 7 | frame 8 | -+---------+---------+ +---------+---------+ - -A frame can be of any size with the only condition it can fit in a block. A block -can only hold an integer number of frames, or in other words, a frame cannot -be spawned across two blocks, so there are some details you have to take into -account when choosing the frame_size. See "Mapping and use of the circular -buffer (ring)". - --------------------------------------------------------------------------------- -+ PACKET_MMAP setting constraints --------------------------------------------------------------------------------- - -In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch), -the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or -16384 in a 64 bit architecture. For information on these kernel versions -see http://pusa.uv.es/~ulisses/packet_mmap/packet_mmap.pre-2.4.26_2.6.5.txt - - Block size limit ------------------- - -As stated earlier, each block is a contiguous physical region of memory. These -memory regions are allocated with calls to the __get_free_pages() function. As -the name indicates, this function allocates pages of memory, and the second -argument is "order" or a power of two number of pages, that is -(for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes, -order=2 ==> 16384 bytes, etc. The maximum size of a -region allocated by __get_free_pages is determined by the MAX_ORDER macro. More -precisely the limit can be calculated as: - - PAGE_SIZE << MAX_ORDER - - In a i386 architecture PAGE_SIZE is 4096 bytes - In a 2.4/i386 kernel MAX_ORDER is 10 - In a 2.6/i386 kernel MAX_ORDER is 11 - -So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel -respectively, with an i386 architecture. - -User space programs can include /usr/include/sys/user.h and -/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations. - -The pagesize can also be determined dynamically with the getpagesize (2) -system call. - - Block number limit --------------------- - -To understand the constraints of PACKET_MMAP, we have to see the structure -used to hold the pointers to each block. - -Currently, this structure is a dynamically allocated vector with kmalloc -called pg_vec, its size limits the number of blocks that can be allocated. - - +---+---+---+---+ - | x | x | x | x | - +---+---+---+---+ - | | | | - | | | v - | | v block #4 - | v block #3 - v block #2 - block #1 - -kmalloc allocates any number of bytes of physically contiguous memory from -a pool of pre-determined sizes. This pool of memory is maintained by the slab -allocator which is at the end the responsible for doing the allocation and -hence which imposes the maximum memory that kmalloc can allocate. - -In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The -predetermined sizes that kmalloc uses can be checked in the "size-" -entries of /proc/slabinfo - -In a 32 bit architecture, pointers are 4 bytes long, so the total number of -pointers to blocks is - - 131072/4 = 32768 blocks - - PACKET_MMAP buffer size calculator ------------------------------------- - -Definitions: - - : is the maximum size of allocable with kmalloc (see /proc/slabinfo) -: depends on the architecture -- sizeof(void *) - : depends on the architecture -- PAGE_SIZE or getpagesize (2) - : is the value defined with MAX_ORDER - : it's an upper bound of frame's capture size (more on this later) - -from these definitions we will derive - - = / - = << - -so, the max buffer size is - - * - -and, the number of frames be - - * / - -Suppose the following parameters, which apply for 2.6 kernel and an -i386 architecture: - - = 131072 bytes - = 4 bytes - = 4096 bytes - = 11 - -and a value for of 2048 bytes. These parameters will yield - - = 131072/4 = 32768 blocks - = 4096 << 11 = 8 MiB. - -and hence the buffer will have a 262144 MiB size. So it can hold -262144 MiB / 2048 bytes = 134217728 frames - -Actually, this buffer size is not possible with an i386 architecture. -Remember that the memory is allocated in kernel space, in the case of -an i386 kernel's memory size is limited to 1GiB. - -All memory allocations are not freed until the socket is closed. The memory -allocations are done with GFP_KERNEL priority, this basically means that -the allocation can wait and swap other process' memory in order to allocate -the necessary memory, so normally limits can be reached. - - Other constraints -------------------- - -If you check the source code you will see that what I draw here as a frame -is not only the link level frame. At the beginning of each frame there is a -header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame -meta information like timestamp. So what we draw here a frame it's really -the following (from include/linux/if_packet.h): - -/* - Frame structure: - - - Start. Frame must be aligned to TPACKET_ALIGNMENT=16 - - struct tpacket_hdr - - pad to TPACKET_ALIGNMENT=16 - - struct sockaddr_ll - - Gap, chosen so that packet data (Start+tp_net) aligns to - TPACKET_ALIGNMENT=16 - - Start+tp_mac: [ Optional MAC header ] - - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16. - - Pad to align to TPACKET_ALIGNMENT=16 - */ - - The following are conditions that are checked in packet_set_ring - - tp_block_size must be a multiple of PAGE_SIZE (1) - tp_frame_size must be greater than TPACKET_HDRLEN (obvious) - tp_frame_size must be a multiple of TPACKET_ALIGNMENT - tp_frame_nr must be exactly frames_per_block*tp_block_nr - -Note that tp_block_size should be chosen to be a power of two or there will -be a waste of memory. - --------------------------------------------------------------------------------- -+ Mapping and use of the circular buffer (ring) --------------------------------------------------------------------------------- - -The mapping of the buffer in the user process is done with the conventional -mmap function. Even the circular buffer is compound of several physically -discontiguous blocks of memory, they are contiguous to the user space, hence -just one call to mmap is needed: - - mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); - -If tp_frame_size is a divisor of tp_block_size frames will be -contiguously spaced by tp_frame_size bytes. If not, each -tp_block_size/tp_frame_size frames there will be a gap between -the frames. This is because a frame cannot be spawn across two -blocks. - -To use one socket for capture and transmission, the mapping of both the -RX and TX buffer ring has to be done with one call to mmap: - - ... - setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &foo, sizeof(foo)); - setsockopt(fd, SOL_PACKET, PACKET_TX_RING, &bar, sizeof(bar)); - ... - rx_ring = mmap(0, size * 2, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); - tx_ring = rx_ring + size; - -RX must be the first as the kernel maps the TX ring memory right -after the RX one. - -At the beginning of each frame there is an status field (see -struct tpacket_hdr). If this field is 0 means that the frame is ready -to be used for the kernel, If not, there is a frame the user can read -and the following flags apply: - -+++ Capture process: - from include/linux/if_packet.h - - #define TP_STATUS_COPY (1 << 1) - #define TP_STATUS_LOSING (1 << 2) - #define TP_STATUS_CSUMNOTREADY (1 << 3) - #define TP_STATUS_CSUM_VALID (1 << 7) - -TP_STATUS_COPY : This flag indicates that the frame (and associated - meta information) has been truncated because it's - larger than tp_frame_size. This packet can be - read entirely with recvfrom(). - - In order to make this work it must to be - enabled previously with setsockopt() and - the PACKET_COPY_THRESH option. - - The number of frames that can be buffered to - be read with recvfrom is limited like a normal socket. - See the SO_RCVBUF option in the socket (7) man page. - -TP_STATUS_LOSING : indicates there were packet drops from last time - statistics where checked with getsockopt() and - the PACKET_STATISTICS option. - -TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets which - its checksum will be done in hardware. So while - reading the packet we should not try to check the - checksum. - -TP_STATUS_CSUM_VALID : This flag indicates that at least the transport - header checksum of the packet has been already - validated on the kernel side. If the flag is not set - then we are free to check the checksum by ourselves - provided that TP_STATUS_CSUMNOTREADY is also not set. - -for convenience there are also the following defines: - - #define TP_STATUS_KERNEL 0 - #define TP_STATUS_USER 1 - -The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel -receives a packet it puts in the buffer and updates the status with -at least the TP_STATUS_USER flag. Then the user can read the packet, -once the packet is read the user must zero the status field, so the kernel -can use again that frame buffer. - -The user can use poll (any other variant should apply too) to check if new -packets are in the ring: - - struct pollfd pfd; - - pfd.fd = fd; - pfd.revents = 0; - pfd.events = POLLIN|POLLRDNORM|POLLERR; - - if (status == TP_STATUS_KERNEL) - retval = poll(&pfd, 1, timeout); - -It doesn't incur in a race condition to first check the status value and -then poll for frames. - -++ Transmission process -Those defines are also used for transmission: - - #define TP_STATUS_AVAILABLE 0 // Frame is available - #define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send() - #define TP_STATUS_SENDING 2 // Frame is currently in transmission - #define TP_STATUS_WRONG_FORMAT 4 // Frame format is not correct - -First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a -packet, the user fills a data buffer of an available frame, sets tp_len to -current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST. -This can be done on multiple frames. Once the user is ready to transmit, it -calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are -forwarded to the network device. The kernel updates each status of sent -frames with TP_STATUS_SENDING until the end of transfer. -At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE. - - header->tp_len = in_i_size; - header->tp_status = TP_STATUS_SEND_REQUEST; - retval = send(this->socket, NULL, 0, 0); - -The user can also use poll() to check if a buffer is available: -(status == TP_STATUS_SENDING) - - struct pollfd pfd; - pfd.fd = fd; - pfd.revents = 0; - pfd.events = POLLOUT; - retval = poll(&pfd, 1, timeout); - -------------------------------------------------------------------------------- -+ What TPACKET versions are available and when to use them? -------------------------------------------------------------------------------- - - int val = tpacket_version; - setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); - getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); - -where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3. - -TPACKET_V1: - - Default if not otherwise specified by setsockopt(2) - - RX_RING, TX_RING available - -TPACKET_V1 --> TPACKET_V2: - - Made 64 bit clean due to unsigned long usage in TPACKET_V1 - structures, thus this also works on 64 bit kernel with 32 bit - userspace and the like - - Timestamp resolution in nanoseconds instead of microseconds - - RX_RING, TX_RING available - - VLAN metadata information available for packets - (TP_STATUS_VLAN_VALID, TP_STATUS_VLAN_TPID_VALID), - in the tpacket2_hdr structure: - - TP_STATUS_VLAN_VALID bit being set into the tp_status field indicates - that the tp_vlan_tci field has valid VLAN TCI value - - TP_STATUS_VLAN_TPID_VALID bit being set into the tp_status field - indicates that the tp_vlan_tpid field has valid VLAN TPID value - - How to switch to TPACKET_V2: - 1. Replace struct tpacket_hdr by struct tpacket2_hdr - 2. Query header len and save - 3. Set protocol version to 2, set up ring as usual - 4. For getting the sockaddr_ll, - use (void *)hdr + TPACKET_ALIGN(hdrlen) instead of - (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) - -TPACKET_V2 --> TPACKET_V3: - - Flexible buffer implementation for RX_RING: - 1. Blocks can be configured with non-static frame-size - 2. Read/poll is at a block-level (as opposed to packet-level) - 3. Added poll timeout to avoid indefinite user-space wait - on idle links - 4. Added user-configurable knobs: - 4.1 block::timeout - 4.2 tpkt_hdr::sk_rxhash - - RX Hash data available in user space - - TX_RING semantics are conceptually similar to TPACKET_V2; - use tpacket3_hdr instead of tpacket2_hdr, and TPACKET3_HDRLEN - instead of TPACKET2_HDRLEN. In the current implementation, - the tp_next_offset field in the tpacket3_hdr MUST be set to - zero, indicating that the ring does not hold variable sized frames. - Packets with non-zero values of tp_next_offset will be dropped. - -------------------------------------------------------------------------------- -+ AF_PACKET fanout mode -------------------------------------------------------------------------------- - -In the AF_PACKET fanout mode, packet reception can be load balanced among -processes. This also works in combination with mmap(2) on packet sockets. - -Currently implemented fanout policies are: - - - PACKET_FANOUT_HASH: schedule to socket by skb's packet hash - - PACKET_FANOUT_LB: schedule to socket by round-robin - - PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on - - PACKET_FANOUT_RND: schedule to socket by random selection - - PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another - - PACKET_FANOUT_QM: schedule to socket by skbs recorded queue_mapping - -Minimal example code by David S. Miller (try things like "./test eth0 hash", -"./test eth0 lb", etc.): - -#include -#include -#include -#include - -#include -#include -#include -#include - -#include - -#include -#include - -#include - -static const char *device_name; -static int fanout_type; -static int fanout_id; - -#ifndef PACKET_FANOUT -# define PACKET_FANOUT 18 -# define PACKET_FANOUT_HASH 0 -# define PACKET_FANOUT_LB 1 -#endif - -static int setup_socket(void) -{ - int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP)); - struct sockaddr_ll ll; - struct ifreq ifr; - int fanout_arg; - - if (fd < 0) { - perror("socket"); - return EXIT_FAILURE; - } - - memset(&ifr, 0, sizeof(ifr)); - strcpy(ifr.ifr_name, device_name); - err = ioctl(fd, SIOCGIFINDEX, &ifr); - if (err < 0) { - perror("SIOCGIFINDEX"); - return EXIT_FAILURE; - } - - memset(&ll, 0, sizeof(ll)); - ll.sll_family = AF_PACKET; - ll.sll_ifindex = ifr.ifr_ifindex; - err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); - if (err < 0) { - perror("bind"); - return EXIT_FAILURE; - } - - fanout_arg = (fanout_id | (fanout_type << 16)); - err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT, - &fanout_arg, sizeof(fanout_arg)); - if (err) { - perror("setsockopt"); - return EXIT_FAILURE; - } - - return fd; -} - -static void fanout_thread(void) -{ - int fd = setup_socket(); - int limit = 10000; - - if (fd < 0) - exit(fd); - - while (limit-- > 0) { - char buf[1600]; - int err; - - err = read(fd, buf, sizeof(buf)); - if (err < 0) { - perror("read"); - exit(EXIT_FAILURE); - } - if ((limit % 10) == 0) - fprintf(stdout, "(%d) \n", getpid()); - } - - fprintf(stdout, "%d: Received 10000 packets\n", getpid()); - - close(fd); - exit(0); -} - -int main(int argc, char **argp) -{ - int fd, err; - int i; - - if (argc != 3) { - fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]); - return EXIT_FAILURE; - } - - if (!strcmp(argp[2], "hash")) - fanout_type = PACKET_FANOUT_HASH; - else if (!strcmp(argp[2], "lb")) - fanout_type = PACKET_FANOUT_LB; - else { - fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]); - exit(EXIT_FAILURE); - } - - device_name = argp[1]; - fanout_id = getpid() & 0xffff; - - for (i = 0; i < 4; i++) { - pid_t pid = fork(); - - switch (pid) { - case 0: - fanout_thread(); - - case -1: - perror("fork"); - exit(EXIT_FAILURE); - } - } - - for (i = 0; i < 4; i++) { - int status; - - wait(&status); - } - - return 0; -} - -------------------------------------------------------------------------------- -+ AF_PACKET TPACKET_V3 example -------------------------------------------------------------------------------- - -AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame -sizes by doing it's own memory management. It is based on blocks where polling -works on a per block basis instead of per ring as in TPACKET_V2 and predecessor. - -It is said that TPACKET_V3 brings the following benefits: - *) ~15 - 20% reduction in CPU-usage - *) ~20% increase in packet capture rate - *) ~2x increase in packet density - *) Port aggregation analysis - *) Non static frame size to capture entire packet payload - -So it seems to be a good candidate to be used with packet fanout. - -Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile -it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.): - -/* Written from scratch, but kernel-to-user space API usage - * dissected from lolpcap: - * Copyright 2011, Chetan Loke - * License: GPL, version 2.0 - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#ifndef likely -# define likely(x) __builtin_expect(!!(x), 1) -#endif -#ifndef unlikely -# define unlikely(x) __builtin_expect(!!(x), 0) -#endif - -struct block_desc { - uint32_t version; - uint32_t offset_to_priv; - struct tpacket_hdr_v1 h1; -}; - -struct ring { - struct iovec *rd; - uint8_t *map; - struct tpacket_req3 req; -}; - -static unsigned long packets_total = 0, bytes_total = 0; -static sig_atomic_t sigint = 0; - -static void sighandler(int num) -{ - sigint = 1; -} - -static int setup_socket(struct ring *ring, char *netdev) -{ - int err, i, fd, v = TPACKET_V3; - struct sockaddr_ll ll; - unsigned int blocksiz = 1 << 22, framesiz = 1 << 11; - unsigned int blocknum = 64; - - fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); - if (fd < 0) { - perror("socket"); - exit(1); - } - - err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v)); - if (err < 0) { - perror("setsockopt"); - exit(1); - } - - memset(&ring->req, 0, sizeof(ring->req)); - ring->req.tp_block_size = blocksiz; - ring->req.tp_frame_size = framesiz; - ring->req.tp_block_nr = blocknum; - ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz; - ring->req.tp_retire_blk_tov = 60; - ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH; - - err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req, - sizeof(ring->req)); - if (err < 0) { - perror("setsockopt"); - exit(1); - } - - ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr, - PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0); - if (ring->map == MAP_FAILED) { - perror("mmap"); - exit(1); - } - - ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd)); - assert(ring->rd); - for (i = 0; i < ring->req.tp_block_nr; ++i) { - ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size); - ring->rd[i].iov_len = ring->req.tp_block_size; - } - - memset(&ll, 0, sizeof(ll)); - ll.sll_family = PF_PACKET; - ll.sll_protocol = htons(ETH_P_ALL); - ll.sll_ifindex = if_nametoindex(netdev); - ll.sll_hatype = 0; - ll.sll_pkttype = 0; - ll.sll_halen = 0; - - err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); - if (err < 0) { - perror("bind"); - exit(1); - } - - return fd; -} - -static void display(struct tpacket3_hdr *ppd) -{ - struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac); - struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN); - - if (eth->h_proto == htons(ETH_P_IP)) { - struct sockaddr_in ss, sd; - char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST]; - - memset(&ss, 0, sizeof(ss)); - ss.sin_family = PF_INET; - ss.sin_addr.s_addr = ip->saddr; - getnameinfo((struct sockaddr *) &ss, sizeof(ss), - sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST); - - memset(&sd, 0, sizeof(sd)); - sd.sin_family = PF_INET; - sd.sin_addr.s_addr = ip->daddr; - getnameinfo((struct sockaddr *) &sd, sizeof(sd), - dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST); - - printf("%s -> %s, ", sbuff, dbuff); - } - - printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash); -} - -static void walk_block(struct block_desc *pbd, const int block_num) -{ - int num_pkts = pbd->h1.num_pkts, i; - unsigned long bytes = 0; - struct tpacket3_hdr *ppd; - - ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd + - pbd->h1.offset_to_first_pkt); - for (i = 0; i < num_pkts; ++i) { - bytes += ppd->tp_snaplen; - display(ppd); - - ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd + - ppd->tp_next_offset); - } - - packets_total += num_pkts; - bytes_total += bytes; -} - -static void flush_block(struct block_desc *pbd) -{ - pbd->h1.block_status = TP_STATUS_KERNEL; -} - -static void teardown_socket(struct ring *ring, int fd) -{ - munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr); - free(ring->rd); - close(fd); -} - -int main(int argc, char **argp) -{ - int fd, err; - socklen_t len; - struct ring ring; - struct pollfd pfd; - unsigned int block_num = 0, blocks = 64; - struct block_desc *pbd; - struct tpacket_stats_v3 stats; - - if (argc != 2) { - fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]); - return EXIT_FAILURE; - } - - signal(SIGINT, sighandler); - - memset(&ring, 0, sizeof(ring)); - fd = setup_socket(&ring, argp[argc - 1]); - assert(fd > 0); - - memset(&pfd, 0, sizeof(pfd)); - pfd.fd = fd; - pfd.events = POLLIN | POLLERR; - pfd.revents = 0; - - while (likely(!sigint)) { - pbd = (struct block_desc *) ring.rd[block_num].iov_base; - - if ((pbd->h1.block_status & TP_STATUS_USER) == 0) { - poll(&pfd, 1, -1); - continue; - } - - walk_block(pbd, block_num); - flush_block(pbd); - block_num = (block_num + 1) % blocks; - } - - len = sizeof(stats); - err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len); - if (err < 0) { - perror("getsockopt"); - exit(1); - } - - fflush(stdout); - printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n", - stats.tp_packets, bytes_total, stats.tp_drops, - stats.tp_freeze_q_cnt); - - teardown_socket(&ring, fd); - return 0; -} - -------------------------------------------------------------------------------- -+ PACKET_QDISC_BYPASS -------------------------------------------------------------------------------- - -If there is a requirement to load the network with many packets in a similar -fashion as pktgen does, you might set the following option after socket -creation: - - int one = 1; - setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one)); - -This has the side-effect, that packets sent through PF_PACKET will bypass the -kernel's qdisc layer and are forcedly pushed to the driver directly. Meaning, -packet are not buffered, tc disciplines are ignored, increased loss can occur -and such packets are also not visible to other PF_PACKET sockets anymore. So, -you have been warned; generally, this can be useful for stress testing various -components of a system. - -On default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled -on PF_PACKET sockets. - -------------------------------------------------------------------------------- -+ PACKET_TIMESTAMP -------------------------------------------------------------------------------- - -The PACKET_TIMESTAMP setting determines the source of the timestamp in -the packet meta information for mmap(2)ed RX_RING and TX_RINGs. If your -NIC is capable of timestamping packets in hardware, you can request those -hardware timestamps to be used. Note: you may need to enable the generation -of hardware timestamps with SIOCSHWTSTAMP (see related information from -Documentation/networking/timestamping.txt). - -PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING: - - int req = SOF_TIMESTAMPING_RAW_HARDWARE; - setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req)) - -For the mmap(2)ed ring buffers, such timestamps are stored in the -tpacket{,2,3}_hdr structure's tp_sec and tp_{n,u}sec members. To determine -what kind of timestamp has been reported, the tp_status field is binary |'ed -with the following possible bits ... - - TP_STATUS_TS_RAW_HARDWARE - TP_STATUS_TS_SOFTWARE - -... that are equivalent to its SOF_TIMESTAMPING_* counterparts. For the -RX_RING, if neither is set (i.e. PACKET_TIMESTAMP is not set), then a -software fallback was invoked *within* PF_PACKET's processing code (less -precise). - -Getting timestamps for the TX_RING works as follows: i) fill the ring frames, -ii) call sendto() e.g. in blocking mode, iii) wait for status of relevant -frames to be updated resp. the frame handed over to the application, iv) walk -through the frames to pick up the individual hw/sw timestamps. - -Only (!) if transmit timestamping is enabled, then these bits are combined -with binary | with TP_STATUS_AVAILABLE, so you must check for that in your -application (e.g. !(tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING)) -in a first step to see if the frame belongs to the application, and then -one can extract the type of timestamp in a second step from tp_status)! - -If you don't care about them, thus having it disabled, checking for -TP_STATUS_AVAILABLE resp. TP_STATUS_WRONG_FORMAT is sufficient. If in the -TX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec -members do not contain a valid value. For TX_RINGs, by default no timestamp -is generated! - -See include/linux/net_tstamp.h and Documentation/networking/timestamping.txt -for more information on hardware timestamps. - -------------------------------------------------------------------------------- -+ Miscellaneous bits -------------------------------------------------------------------------------- - -- Packet sockets work well together with Linux socket filters, thus you also - might want to have a look at Documentation/networking/filter.rst - --------------------------------------------------------------------------------- -+ THANKS --------------------------------------------------------------------------------- - - Jesse Brandeburg, for fixing my grammathical/spelling errors - -- cgit v1.2.3 From 6e94eaaa400d66f13e25e071926047ef2e3d21e3 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:12 +0200 Subject: docs: networking: convert phonet.txt to ReST MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - add SPDX header; - adjust title markup; - use copyright symbol; - add notes markups; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Rémi Denis-Courmont Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/packet_mmap.rst | 2 +- Documentation/networking/phonet.rst | 230 +++++++++++++++++++++++++++++++ Documentation/networking/phonet.txt | 214 ---------------------------- MAINTAINERS | 2 +- 5 files changed, 233 insertions(+), 216 deletions(-) create mode 100644 Documentation/networking/phonet.rst delete mode 100644 Documentation/networking/phonet.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 8262b535a83e..e460026331c6 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -90,6 +90,7 @@ Contents: openvswitch operstates packet_mmap + phonet .. only:: subproject and html diff --git a/Documentation/networking/packet_mmap.rst b/Documentation/networking/packet_mmap.rst index 5f213d17652f..884c7222b9e9 100644 --- a/Documentation/networking/packet_mmap.rst +++ b/Documentation/networking/packet_mmap.rst @@ -1076,7 +1076,7 @@ Miscellaneous bits ================== - Packet sockets work well together with Linux socket filters, thus you also - might want to have a look at Documentation/networking/filter.txt + might want to have a look at Documentation/networking/filter.rst THANKS ====== diff --git a/Documentation/networking/phonet.rst b/Documentation/networking/phonet.rst new file mode 100644 index 000000000000..8668dcbc5e6a --- /dev/null +++ b/Documentation/networking/phonet.rst @@ -0,0 +1,230 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +============================ +Linux Phonet protocol family +============================ + +Introduction +------------ + +Phonet is a packet protocol used by Nokia cellular modems for both IPC +and RPC. With the Linux Phonet socket family, Linux host processes can +receive and send messages from/to the modem, or any other external +device attached to the modem. The modem takes care of routing. + +Phonet packets can be exchanged through various hardware connections +depending on the device, such as: + + - USB with the CDC Phonet interface, + - infrared, + - Bluetooth, + - an RS232 serial port (with a dedicated "FBUS" line discipline), + - the SSI bus with some TI OMAP processors. + + +Packets format +-------------- + +Phonet packets have a common header as follows:: + + struct phonethdr { + uint8_t pn_media; /* Media type (link-layer identifier) */ + uint8_t pn_rdev; /* Receiver device ID */ + uint8_t pn_sdev; /* Sender device ID */ + uint8_t pn_res; /* Resource ID or function */ + uint16_t pn_length; /* Big-endian message byte length (minus 6) */ + uint8_t pn_robj; /* Receiver object ID */ + uint8_t pn_sobj; /* Sender object ID */ + }; + +On Linux, the link-layer header includes the pn_media byte (see below). +The next 7 bytes are part of the network-layer header. + +The device ID is split: the 6 higher-order bits constitute the device +address, while the 2 lower-order bits are used for multiplexing, as are +the 8-bit object identifiers. As such, Phonet can be considered as a +network layer with 6 bits of address space and 10 bits for transport +protocol (much like port numbers in IP world). + +The modem always has address number zero. All other device have a their +own 6-bit address. + + +Link layer +---------- + +Phonet links are always point-to-point links. The link layer header +consists of a single Phonet media type byte. It uniquely identifies the +link through which the packet is transmitted, from the modem's +perspective. Each Phonet network device shall prepend and set the media +type byte as appropriate. For convenience, a common phonet_header_ops +link-layer header operations structure is provided. It sets the +media type according to the network device hardware address. + +Linux Phonet network interfaces support a dedicated link layer packets +type (ETH_P_PHONET) which is out of the Ethernet type range. They can +only send and receive Phonet packets. + +The virtual TUN tunnel device driver can also be used for Phonet. This +requires IFF_TUN mode, _without_ the IFF_NO_PI flag. In this case, +there is no link-layer header, so there is no Phonet media type byte. + +Note that Phonet interfaces are not allowed to re-order packets, so +only the (default) Linux FIFO qdisc should be used with them. + + +Network layer +------------- + +The Phonet socket address family maps the Phonet packet header:: + + struct sockaddr_pn { + sa_family_t spn_family; /* AF_PHONET */ + uint8_t spn_obj; /* Object ID */ + uint8_t spn_dev; /* Device ID */ + uint8_t spn_resource; /* Resource or function */ + uint8_t spn_zero[...]; /* Padding */ + }; + +The resource field is only used when sending and receiving; +It is ignored by bind() and getsockname(). + + +Low-level datagram protocol +--------------------------- + +Applications can send Phonet messages using the Phonet datagram socket +protocol from the PF_PHONET family. Each socket is bound to one of the +2^10 object IDs available, and can send and receive packets with any +other peer. + +:: + + struct sockaddr_pn addr = { .spn_family = AF_PHONET, }; + ssize_t len; + socklen_t addrlen = sizeof(addr); + int fd; + + fd = socket(PF_PHONET, SOCK_DGRAM, 0); + bind(fd, (struct sockaddr *)&addr, sizeof(addr)); + /* ... */ + + sendto(fd, msg, msglen, 0, (struct sockaddr *)&addr, sizeof(addr)); + len = recvfrom(fd, buf, sizeof(buf), 0, + (struct sockaddr *)&addr, &addrlen); + +This protocol follows the SOCK_DGRAM connection-less semantics. +However, connect() and getpeername() are not supported, as they did +not seem useful with Phonet usages (could be added easily). + + +Resource subscription +--------------------- + +A Phonet datagram socket can be subscribed to any number of 8-bits +Phonet resources, as follow:: + + uint32_t res = 0xXX; + ioctl(fd, SIOCPNADDRESOURCE, &res); + +Subscription is similarly cancelled using the SIOCPNDELRESOURCE I/O +control request, or when the socket is closed. + +Note that no more than one socket can be subcribed to any given +resource at a time. If not, ioctl() will return EBUSY. + + +Phonet Pipe protocol +-------------------- + +The Phonet Pipe protocol is a simple sequenced packets protocol +with end-to-end congestion control. It uses the passive listening +socket paradigm. The listening socket is bound to an unique free object +ID. Each listening socket can handle up to 255 simultaneous +connections, one per accept()'d socket. + +:: + + int lfd, cfd; + + lfd = socket(PF_PHONET, SOCK_SEQPACKET, PN_PROTO_PIPE); + listen (lfd, INT_MAX); + + /* ... */ + cfd = accept(lfd, NULL, NULL); + for (;;) + { + char buf[...]; + ssize_t len = read(cfd, buf, sizeof(buf)); + + /* ... */ + + write(cfd, msg, msglen); + } + +Connections are traditionally established between two endpoints by a +"third party" application. This means that both endpoints are passive. + + +As of Linux kernel version 2.6.39, it is also possible to connect +two endpoints directly, using connect() on the active side. This is +intended to support the newer Nokia Wireless Modem API, as found in +e.g. the Nokia Slim Modem in the ST-Ericsson U8500 platform:: + + struct sockaddr_spn spn; + int fd; + + fd = socket(PF_PHONET, SOCK_SEQPACKET, PN_PROTO_PIPE); + memset(&spn, 0, sizeof(spn)); + spn.spn_family = AF_PHONET; + spn.spn_obj = ...; + spn.spn_dev = ...; + spn.spn_resource = 0xD9; + connect(fd, (struct sockaddr *)&spn, sizeof(spn)); + /* normal I/O here ... */ + close(fd); + + +.. Warning: + + When polling a connected pipe socket for writability, there is an + intrinsic race condition whereby writability might be lost between the + polling and the writing system calls. In this case, the socket will + block until write becomes possible again, unless non-blocking mode + is enabled. + + +The pipe protocol provides two socket options at the SOL_PNPIPE level: + + PNPIPE_ENCAP accepts one integer value (int) of: + + PNPIPE_ENCAP_NONE: + The socket operates normally (default). + + PNPIPE_ENCAP_IP: + The socket is used as a backend for a virtual IP + interface. This requires CAP_NET_ADMIN capability. GPRS data + support on Nokia modems can use this. Note that the socket cannot + be reliably poll()'d or read() from while in this mode. + + PNPIPE_IFINDEX + is a read-only integer value. It contains the + interface index of the network interface created by PNPIPE_ENCAP, + or zero if encapsulation is off. + + PNPIPE_HANDLE + is a read-only integer value. It contains the underlying + identifier ("pipe handle") of the pipe. This is only defined for + socket descriptors that are already connected or being connected. + + +Authors +------- + +Linux Phonet was initially written by Sakari Ailus. + +Other contributors include Mikä Liljeberg, Andras Domokos, +Carlos Chinea and Rémi Denis-Courmont. + +Copyright |copy| 2008 Nokia Corporation. diff --git a/Documentation/networking/phonet.txt b/Documentation/networking/phonet.txt deleted file mode 100644 index 81003581f47a..000000000000 --- a/Documentation/networking/phonet.txt +++ /dev/null @@ -1,214 +0,0 @@ -Linux Phonet protocol family -============================ - -Introduction ------------- - -Phonet is a packet protocol used by Nokia cellular modems for both IPC -and RPC. With the Linux Phonet socket family, Linux host processes can -receive and send messages from/to the modem, or any other external -device attached to the modem. The modem takes care of routing. - -Phonet packets can be exchanged through various hardware connections -depending on the device, such as: - - USB with the CDC Phonet interface, - - infrared, - - Bluetooth, - - an RS232 serial port (with a dedicated "FBUS" line discipline), - - the SSI bus with some TI OMAP processors. - - -Packets format --------------- - -Phonet packets have a common header as follows: - - struct phonethdr { - uint8_t pn_media; /* Media type (link-layer identifier) */ - uint8_t pn_rdev; /* Receiver device ID */ - uint8_t pn_sdev; /* Sender device ID */ - uint8_t pn_res; /* Resource ID or function */ - uint16_t pn_length; /* Big-endian message byte length (minus 6) */ - uint8_t pn_robj; /* Receiver object ID */ - uint8_t pn_sobj; /* Sender object ID */ - }; - -On Linux, the link-layer header includes the pn_media byte (see below). -The next 7 bytes are part of the network-layer header. - -The device ID is split: the 6 higher-order bits constitute the device -address, while the 2 lower-order bits are used for multiplexing, as are -the 8-bit object identifiers. As such, Phonet can be considered as a -network layer with 6 bits of address space and 10 bits for transport -protocol (much like port numbers in IP world). - -The modem always has address number zero. All other device have a their -own 6-bit address. - - -Link layer ----------- - -Phonet links are always point-to-point links. The link layer header -consists of a single Phonet media type byte. It uniquely identifies the -link through which the packet is transmitted, from the modem's -perspective. Each Phonet network device shall prepend and set the media -type byte as appropriate. For convenience, a common phonet_header_ops -link-layer header operations structure is provided. It sets the -media type according to the network device hardware address. - -Linux Phonet network interfaces support a dedicated link layer packets -type (ETH_P_PHONET) which is out of the Ethernet type range. They can -only send and receive Phonet packets. - -The virtual TUN tunnel device driver can also be used for Phonet. This -requires IFF_TUN mode, _without_ the IFF_NO_PI flag. In this case, -there is no link-layer header, so there is no Phonet media type byte. - -Note that Phonet interfaces are not allowed to re-order packets, so -only the (default) Linux FIFO qdisc should be used with them. - - -Network layer -------------- - -The Phonet socket address family maps the Phonet packet header: - - struct sockaddr_pn { - sa_family_t spn_family; /* AF_PHONET */ - uint8_t spn_obj; /* Object ID */ - uint8_t spn_dev; /* Device ID */ - uint8_t spn_resource; /* Resource or function */ - uint8_t spn_zero[...]; /* Padding */ - }; - -The resource field is only used when sending and receiving; -It is ignored by bind() and getsockname(). - - -Low-level datagram protocol ---------------------------- - -Applications can send Phonet messages using the Phonet datagram socket -protocol from the PF_PHONET family. Each socket is bound to one of the -2^10 object IDs available, and can send and receive packets with any -other peer. - - struct sockaddr_pn addr = { .spn_family = AF_PHONET, }; - ssize_t len; - socklen_t addrlen = sizeof(addr); - int fd; - - fd = socket(PF_PHONET, SOCK_DGRAM, 0); - bind(fd, (struct sockaddr *)&addr, sizeof(addr)); - /* ... */ - - sendto(fd, msg, msglen, 0, (struct sockaddr *)&addr, sizeof(addr)); - len = recvfrom(fd, buf, sizeof(buf), 0, - (struct sockaddr *)&addr, &addrlen); - -This protocol follows the SOCK_DGRAM connection-less semantics. -However, connect() and getpeername() are not supported, as they did -not seem useful with Phonet usages (could be added easily). - - -Resource subscription ---------------------- - -A Phonet datagram socket can be subscribed to any number of 8-bits -Phonet resources, as follow: - - uint32_t res = 0xXX; - ioctl(fd, SIOCPNADDRESOURCE, &res); - -Subscription is similarly cancelled using the SIOCPNDELRESOURCE I/O -control request, or when the socket is closed. - -Note that no more than one socket can be subcribed to any given -resource at a time. If not, ioctl() will return EBUSY. - - -Phonet Pipe protocol --------------------- - -The Phonet Pipe protocol is a simple sequenced packets protocol -with end-to-end congestion control. It uses the passive listening -socket paradigm. The listening socket is bound to an unique free object -ID. Each listening socket can handle up to 255 simultaneous -connections, one per accept()'d socket. - - int lfd, cfd; - - lfd = socket(PF_PHONET, SOCK_SEQPACKET, PN_PROTO_PIPE); - listen (lfd, INT_MAX); - - /* ... */ - cfd = accept(lfd, NULL, NULL); - for (;;) - { - char buf[...]; - ssize_t len = read(cfd, buf, sizeof(buf)); - - /* ... */ - - write(cfd, msg, msglen); - } - -Connections are traditionally established between two endpoints by a -"third party" application. This means that both endpoints are passive. - - -As of Linux kernel version 2.6.39, it is also possible to connect -two endpoints directly, using connect() on the active side. This is -intended to support the newer Nokia Wireless Modem API, as found in -e.g. the Nokia Slim Modem in the ST-Ericsson U8500 platform: - - struct sockaddr_spn spn; - int fd; - - fd = socket(PF_PHONET, SOCK_SEQPACKET, PN_PROTO_PIPE); - memset(&spn, 0, sizeof(spn)); - spn.spn_family = AF_PHONET; - spn.spn_obj = ...; - spn.spn_dev = ...; - spn.spn_resource = 0xD9; - connect(fd, (struct sockaddr *)&spn, sizeof(spn)); - /* normal I/O here ... */ - close(fd); - - -WARNING: -When polling a connected pipe socket for writability, there is an -intrinsic race condition whereby writability might be lost between the -polling and the writing system calls. In this case, the socket will -block until write becomes possible again, unless non-blocking mode -is enabled. - - -The pipe protocol provides two socket options at the SOL_PNPIPE level: - - PNPIPE_ENCAP accepts one integer value (int) of: - - PNPIPE_ENCAP_NONE: The socket operates normally (default). - - PNPIPE_ENCAP_IP: The socket is used as a backend for a virtual IP - interface. This requires CAP_NET_ADMIN capability. GPRS data - support on Nokia modems can use this. Note that the socket cannot - be reliably poll()'d or read() from while in this mode. - - PNPIPE_IFINDEX is a read-only integer value. It contains the - interface index of the network interface created by PNPIPE_ENCAP, - or zero if encapsulation is off. - - PNPIPE_HANDLE is a read-only integer value. It contains the underlying - identifier ("pipe handle") of the pipe. This is only defined for - socket descriptors that are already connected or being connected. - - -Authors -------- - -Linux Phonet was initially written by Sakari Ailus. -Other contributors include Mikä Liljeberg, Andras Domokos, -Carlos Chinea and Rémi Denis-Courmont. -Copyright (C) 2008 Nokia Corporation. diff --git a/MAINTAINERS b/MAINTAINERS index 33bfc9e4aead..785f56e5f210 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -13262,7 +13262,7 @@ F: drivers/input/joystick/pxrc.c PHONET PROTOCOL M: Remi Denis-Courmont S: Supported -F: Documentation/networking/phonet.txt +F: Documentation/networking/phonet.rst F: include/linux/phonet.h F: include/net/phonet/ F: include/uapi/linux/phonet.h -- cgit v1.2.3 From c1e4535f24bcfeef55a7ed409a5f50548e284426 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:13 +0200 Subject: docs: networking: convert pktgen.txt to ReST - add SPDX header; - adjust title markup; - use bold markups on a few places; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/pktgen.rst | 412 ++++++++++++++++++++++++++++++++++++ Documentation/networking/pktgen.txt | 400 ---------------------------------- net/Kconfig | 2 +- net/core/pktgen.c | 2 +- samples/pktgen/README.rst | 2 +- 6 files changed, 416 insertions(+), 403 deletions(-) create mode 100644 Documentation/networking/pktgen.rst delete mode 100644 Documentation/networking/pktgen.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index e460026331c6..696181a96e3c 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -91,6 +91,7 @@ Contents: operstates packet_mmap phonet + pktgen .. only:: subproject and html diff --git a/Documentation/networking/pktgen.rst b/Documentation/networking/pktgen.rst new file mode 100644 index 000000000000..7afa1c9f1183 --- /dev/null +++ b/Documentation/networking/pktgen.rst @@ -0,0 +1,412 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================================== +HOWTO for the linux packet generator +==================================== + +Enable CONFIG_NET_PKTGEN to compile and build pktgen either in-kernel +or as a module. A module is preferred; modprobe pktgen if needed. Once +running, pktgen creates a thread for each CPU with affinity to that CPU. +Monitoring and controlling is done via /proc. It is easiest to select a +suitable sample script and configure that. + +On a dual CPU:: + + ps aux | grep pkt + root 129 0.3 0.0 0 0 ? SW 2003 523:20 [kpktgend_0] + root 130 0.3 0.0 0 0 ? SW 2003 509:50 [kpktgend_1] + + +For monitoring and control pktgen creates:: + + /proc/net/pktgen/pgctrl + /proc/net/pktgen/kpktgend_X + /proc/net/pktgen/ethX + + +Tuning NIC for max performance +============================== + +The default NIC settings are (likely) not tuned for pktgen's artificial +overload type of benchmarking, as this could hurt the normal use-case. + +Specifically increasing the TX ring buffer in the NIC:: + + # ethtool -G ethX tx 1024 + +A larger TX ring can improve pktgen's performance, while it can hurt +in the general case, 1) because the TX ring buffer might get larger +than the CPU's L1/L2 cache, 2) because it allows more queueing in the +NIC HW layer (which is bad for bufferbloat). + +One should hesitate to conclude that packets/descriptors in the HW +TX ring cause delay. Drivers usually delay cleaning up the +ring-buffers for various performance reasons, and packets stalling +the TX ring might just be waiting for cleanup. + +This cleanup issue is specifically the case for the driver ixgbe +(Intel 82599 chip). This driver (ixgbe) combines TX+RX ring cleanups, +and the cleanup interval is affected by the ethtool --coalesce setting +of parameter "rx-usecs". + +For ixgbe use e.g. "30" resulting in approx 33K interrupts/sec (1/30*10^6):: + + # ethtool -C ethX rx-usecs 30 + + +Kernel threads +============== +Pktgen creates a thread for each CPU with affinity to that CPU. +Which is controlled through procfile /proc/net/pktgen/kpktgend_X. + +Example: /proc/net/pktgen/kpktgend_0:: + + Running: + Stopped: eth4@0 + Result: OK: add_device=eth4@0 + +Most important are the devices assigned to the thread. + +The two basic thread commands are: + + * add_device DEVICE@NAME -- adds a single device + * rem_device_all -- remove all associated devices + +When adding a device to a thread, a corresponding procfile is created +which is used for configuring this device. Thus, device names need to +be unique. + +To support adding the same device to multiple threads, which is useful +with multi queue NICs, the device naming scheme is extended with "@": +device@something + +The part after "@" can be anything, but it is custom to use the thread +number. + +Viewing devices +=============== + +The Params section holds configured information. The Current section +holds running statistics. The Result is printed after a run or after +interruption. Example:: + + /proc/net/pktgen/eth4@0 + + Params: count 100000 min_pkt_size: 60 max_pkt_size: 60 + frags: 0 delay: 0 clone_skb: 64 ifname: eth4@0 + flows: 0 flowlen: 0 + queue_map_min: 0 queue_map_max: 0 + dst_min: 192.168.81.2 dst_max: + src_min: src_max: + src_mac: 90:e2:ba:0a:56:b4 dst_mac: 00:1b:21:3c:9d:f8 + udp_src_min: 9 udp_src_max: 109 udp_dst_min: 9 udp_dst_max: 9 + src_mac_count: 0 dst_mac_count: 0 + Flags: UDPSRC_RND NO_TIMESTAMP QUEUE_MAP_CPU + Current: + pkts-sofar: 100000 errors: 0 + started: 623913381008us stopped: 623913396439us idle: 25us + seq_num: 100001 cur_dst_mac_offset: 0 cur_src_mac_offset: 0 + cur_saddr: 192.168.8.3 cur_daddr: 192.168.81.2 + cur_udp_dst: 9 cur_udp_src: 42 + cur_queue_map: 0 + flows: 0 + Result: OK: 15430(c15405+d25) usec, 100000 (60byte,0frags) + 6480562pps 3110Mb/sec (3110669760bps) errors: 0 + + +Configuring devices +=================== +This is done via the /proc interface, and most easily done via pgset +as defined in the sample scripts. +You need to specify PGDEV environment variable to use functions from sample +scripts, i.e.:: + + export PGDEV=/proc/net/pktgen/eth4@0 + source samples/pktgen/functions.sh + +Examples:: + + pg_ctrl start starts injection. + pg_ctrl stop aborts injection. Also, ^C aborts generator. + + pgset "clone_skb 1" sets the number of copies of the same packet + pgset "clone_skb 0" use single SKB for all transmits + pgset "burst 8" uses xmit_more API to queue 8 copies of the same + packet and update HW tx queue tail pointer once. + "burst 1" is the default + pgset "pkt_size 9014" sets packet size to 9014 + pgset "frags 5" packet will consist of 5 fragments + pgset "count 200000" sets number of packets to send, set to zero + for continuous sends until explicitly stopped. + + pgset "delay 5000" adds delay to hard_start_xmit(). nanoseconds + + pgset "dst 10.0.0.1" sets IP destination address + (BEWARE! This generator is very aggressive!) + + pgset "dst_min 10.0.0.1" Same as dst + pgset "dst_max 10.0.0.254" Set the maximum destination IP. + pgset "src_min 10.0.0.1" Set the minimum (or only) source IP. + pgset "src_max 10.0.0.254" Set the maximum source IP. + pgset "dst6 fec0::1" IPV6 destination address + pgset "src6 fec0::2" IPV6 source address + pgset "dstmac 00:00:00:00:00:00" sets MAC destination address + pgset "srcmac 00:00:00:00:00:00" sets MAC source address + + pgset "queue_map_min 0" Sets the min value of tx queue interval + pgset "queue_map_max 7" Sets the max value of tx queue interval, for multiqueue devices + To select queue 1 of a given device, + use queue_map_min=1 and queue_map_max=1 + + pgset "src_mac_count 1" Sets the number of MACs we'll range through. + The 'minimum' MAC is what you set with srcmac. + + pgset "dst_mac_count 1" Sets the number of MACs we'll range through. + The 'minimum' MAC is what you set with dstmac. + + pgset "flag [name]" Set a flag to determine behaviour. Current flags + are: IPSRC_RND # IP source is random (between min/max) + IPDST_RND # IP destination is random + UDPSRC_RND, UDPDST_RND, + MACSRC_RND, MACDST_RND + TXSIZE_RND, IPV6, + MPLS_RND, VID_RND, SVID_RND + FLOW_SEQ, + QUEUE_MAP_RND # queue map random + QUEUE_MAP_CPU # queue map mirrors smp_processor_id() + UDPCSUM, + IPSEC # IPsec encapsulation (needs CONFIG_XFRM) + NODE_ALLOC # node specific memory allocation + NO_TIMESTAMP # disable timestamping + pgset 'flag ![name]' Clear a flag to determine behaviour. + Note that you might need to use single quote in + interactive mode, so that your shell wouldn't expand + the specified flag as a history command. + + pgset "spi [SPI_VALUE]" Set specific SA used to transform packet. + + pgset "udp_src_min 9" set UDP source port min, If < udp_src_max, then + cycle through the port range. + + pgset "udp_src_max 9" set UDP source port max. + pgset "udp_dst_min 9" set UDP destination port min, If < udp_dst_max, then + cycle through the port range. + pgset "udp_dst_max 9" set UDP destination port max. + + pgset "mpls 0001000a,0002000a,0000000a" set MPLS labels (in this example + outer label=16,middle label=32, + inner label=0 (IPv4 NULL)) Note that + there must be no spaces between the + arguments. Leading zeros are required. + Do not set the bottom of stack bit, + that's done automatically. If you do + set the bottom of stack bit, that + indicates that you want to randomly + generate that address and the flag + MPLS_RND will be turned on. You + can have any mix of random and fixed + labels in the label stack. + + pgset "mpls 0" turn off mpls (or any invalid argument works too!) + + pgset "vlan_id 77" set VLAN ID 0-4095 + pgset "vlan_p 3" set priority bit 0-7 (default 0) + pgset "vlan_cfi 0" set canonical format identifier 0-1 (default 0) + + pgset "svlan_id 22" set SVLAN ID 0-4095 + pgset "svlan_p 3" set priority bit 0-7 (default 0) + pgset "svlan_cfi 0" set canonical format identifier 0-1 (default 0) + + pgset "vlan_id 9999" > 4095 remove vlan and svlan tags + pgset "svlan 9999" > 4095 remove svlan tag + + + pgset "tos XX" set former IPv4 TOS field (e.g. "tos 28" for AF11 no ECN, default 00) + pgset "traffic_class XX" set former IPv6 TRAFFIC CLASS (e.g. "traffic_class B8" for EF no ECN, default 00) + + pgset "rate 300M" set rate to 300 Mb/s + pgset "ratep 1000000" set rate to 1Mpps + + pgset "xmit_mode netif_receive" RX inject into stack netif_receive_skb() + Works with "burst" but not with "clone_skb". + Default xmit_mode is "start_xmit". + +Sample scripts +============== + +A collection of tutorial scripts and helpers for pktgen is in the +samples/pktgen directory. The helper parameters.sh file support easy +and consistent parameter parsing across the sample scripts. + +Usage example and help:: + + ./pktgen_sample01_simple.sh -i eth4 -m 00:1B:21:3C:9D:F8 -d 192.168.8.2 + +Usage::: + + ./pktgen_sample01_simple.sh [-vx] -i ethX + + -i : ($DEV) output interface/device (required) + -s : ($PKT_SIZE) packet size + -d : ($DEST_IP) destination IP + -m : ($DST_MAC) destination MAC-addr + -t : ($THREADS) threads to start + -c : ($SKB_CLONE) SKB clones send before alloc new SKB + -b : ($BURST) HW level bursting of SKBs + -v : ($VERBOSE) verbose + -x : ($DEBUG) debug + +The global variables being set are also listed. E.g. the required +interface/device parameter "-i" sets variable $DEV. Copy the +pktgen_sampleXX scripts and modify them to fit your own needs. + +The old scripts:: + + pktgen.conf-1-2 # 1 CPU 2 dev + pktgen.conf-1-1-rdos # 1 CPU 1 dev w. route DoS + pktgen.conf-1-1-ip6 # 1 CPU 1 dev ipv6 + pktgen.conf-1-1-ip6-rdos # 1 CPU 1 dev ipv6 w. route DoS + pktgen.conf-1-1-flows # 1 CPU 1 dev multiple flows. + + +Interrupt affinity +=================== +Note that when adding devices to a specific CPU it is a good idea to +also assign /proc/irq/XX/smp_affinity so that the TX interrupts are bound +to the same CPU. This reduces cache bouncing when freeing skbs. + +Plus using the device flag QUEUE_MAP_CPU, which maps the SKBs TX queue +to the running threads CPU (directly from smp_processor_id()). + +Enable IPsec +============ +Default IPsec transformation with ESP encapsulation plus transport mode +can be enabled by simply setting:: + + pgset "flag IPSEC" + pgset "flows 1" + +To avoid breaking existing testbed scripts for using AH type and tunnel mode, +you can use "pgset spi SPI_VALUE" to specify which transformation mode +to employ. + + +Current commands and configuration options +========================================== + +**Pgcontrol commands**:: + + start + stop + reset + +**Thread commands**:: + + add_device + rem_device_all + + +**Device commands**:: + + count + clone_skb + burst + debug + + frags + delay + + src_mac_count + dst_mac_count + + pkt_size + min_pkt_size + max_pkt_size + + queue_map_min + queue_map_max + skb_priority + + tos (ipv4) + traffic_class (ipv6) + + mpls + + udp_src_min + udp_src_max + + udp_dst_min + udp_dst_max + + node + + flag + IPSRC_RND + IPDST_RND + UDPSRC_RND + UDPDST_RND + MACSRC_RND + MACDST_RND + TXSIZE_RND + IPV6 + MPLS_RND + VID_RND + SVID_RND + FLOW_SEQ + QUEUE_MAP_RND + QUEUE_MAP_CPU + UDPCSUM + IPSEC + NODE_ALLOC + NO_TIMESTAMP + + spi (ipsec) + + dst_min + dst_max + + src_min + src_max + + dst_mac + src_mac + + clear_counters + + src6 + dst6 + dst6_max + dst6_min + + flows + flowlen + + rate + ratep + + xmit_mode + + vlan_cfi + vlan_id + vlan_p + + svlan_cfi + svlan_id + svlan_p + + +References: + +- ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/ +- tp://robur.slu.se/pub/Linux/net-development/pktgen-testing/examples/ + +Paper from Linux-Kongress in Erlangen 2004. +- ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/pktgen_paper.pdf + +Thanks to: + +Grant Grundler for testing on IA-64 and parisc, Harald Welte, Lennert Buytenhek +Stephen Hemminger, Andi Kleen, Dave Miller and many others. + + +Good luck with the linux net-development. diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.txt deleted file mode 100644 index d2fd78f85aa4..000000000000 --- a/Documentation/networking/pktgen.txt +++ /dev/null @@ -1,400 +0,0 @@ - - - HOWTO for the linux packet generator - ------------------------------------ - -Enable CONFIG_NET_PKTGEN to compile and build pktgen either in-kernel -or as a module. A module is preferred; modprobe pktgen if needed. Once -running, pktgen creates a thread for each CPU with affinity to that CPU. -Monitoring and controlling is done via /proc. It is easiest to select a -suitable sample script and configure that. - -On a dual CPU: - -ps aux | grep pkt -root 129 0.3 0.0 0 0 ? SW 2003 523:20 [kpktgend_0] -root 130 0.3 0.0 0 0 ? SW 2003 509:50 [kpktgend_1] - - -For monitoring and control pktgen creates: - /proc/net/pktgen/pgctrl - /proc/net/pktgen/kpktgend_X - /proc/net/pktgen/ethX - - -Tuning NIC for max performance -============================== - -The default NIC settings are (likely) not tuned for pktgen's artificial -overload type of benchmarking, as this could hurt the normal use-case. - -Specifically increasing the TX ring buffer in the NIC: - # ethtool -G ethX tx 1024 - -A larger TX ring can improve pktgen's performance, while it can hurt -in the general case, 1) because the TX ring buffer might get larger -than the CPU's L1/L2 cache, 2) because it allows more queueing in the -NIC HW layer (which is bad for bufferbloat). - -One should hesitate to conclude that packets/descriptors in the HW -TX ring cause delay. Drivers usually delay cleaning up the -ring-buffers for various performance reasons, and packets stalling -the TX ring might just be waiting for cleanup. - -This cleanup issue is specifically the case for the driver ixgbe -(Intel 82599 chip). This driver (ixgbe) combines TX+RX ring cleanups, -and the cleanup interval is affected by the ethtool --coalesce setting -of parameter "rx-usecs". - -For ixgbe use e.g. "30" resulting in approx 33K interrupts/sec (1/30*10^6): - # ethtool -C ethX rx-usecs 30 - - -Kernel threads -============== -Pktgen creates a thread for each CPU with affinity to that CPU. -Which is controlled through procfile /proc/net/pktgen/kpktgend_X. - -Example: /proc/net/pktgen/kpktgend_0 - - Running: - Stopped: eth4@0 - Result: OK: add_device=eth4@0 - -Most important are the devices assigned to the thread. - -The two basic thread commands are: - * add_device DEVICE@NAME -- adds a single device - * rem_device_all -- remove all associated devices - -When adding a device to a thread, a corresponding procfile is created -which is used for configuring this device. Thus, device names need to -be unique. - -To support adding the same device to multiple threads, which is useful -with multi queue NICs, the device naming scheme is extended with "@": - device@something - -The part after "@" can be anything, but it is custom to use the thread -number. - -Viewing devices -=============== - -The Params section holds configured information. The Current section -holds running statistics. The Result is printed after a run or after -interruption. Example: - -/proc/net/pktgen/eth4@0 - - Params: count 100000 min_pkt_size: 60 max_pkt_size: 60 - frags: 0 delay: 0 clone_skb: 64 ifname: eth4@0 - flows: 0 flowlen: 0 - queue_map_min: 0 queue_map_max: 0 - dst_min: 192.168.81.2 dst_max: - src_min: src_max: - src_mac: 90:e2:ba:0a:56:b4 dst_mac: 00:1b:21:3c:9d:f8 - udp_src_min: 9 udp_src_max: 109 udp_dst_min: 9 udp_dst_max: 9 - src_mac_count: 0 dst_mac_count: 0 - Flags: UDPSRC_RND NO_TIMESTAMP QUEUE_MAP_CPU - Current: - pkts-sofar: 100000 errors: 0 - started: 623913381008us stopped: 623913396439us idle: 25us - seq_num: 100001 cur_dst_mac_offset: 0 cur_src_mac_offset: 0 - cur_saddr: 192.168.8.3 cur_daddr: 192.168.81.2 - cur_udp_dst: 9 cur_udp_src: 42 - cur_queue_map: 0 - flows: 0 - Result: OK: 15430(c15405+d25) usec, 100000 (60byte,0frags) - 6480562pps 3110Mb/sec (3110669760bps) errors: 0 - - -Configuring devices -=================== -This is done via the /proc interface, and most easily done via pgset -as defined in the sample scripts. -You need to specify PGDEV environment variable to use functions from sample -scripts, i.e.: -export PGDEV=/proc/net/pktgen/eth4@0 -source samples/pktgen/functions.sh - -Examples: - - pg_ctrl start starts injection. - pg_ctrl stop aborts injection. Also, ^C aborts generator. - - pgset "clone_skb 1" sets the number of copies of the same packet - pgset "clone_skb 0" use single SKB for all transmits - pgset "burst 8" uses xmit_more API to queue 8 copies of the same - packet and update HW tx queue tail pointer once. - "burst 1" is the default - pgset "pkt_size 9014" sets packet size to 9014 - pgset "frags 5" packet will consist of 5 fragments - pgset "count 200000" sets number of packets to send, set to zero - for continuous sends until explicitly stopped. - - pgset "delay 5000" adds delay to hard_start_xmit(). nanoseconds - - pgset "dst 10.0.0.1" sets IP destination address - (BEWARE! This generator is very aggressive!) - - pgset "dst_min 10.0.0.1" Same as dst - pgset "dst_max 10.0.0.254" Set the maximum destination IP. - pgset "src_min 10.0.0.1" Set the minimum (or only) source IP. - pgset "src_max 10.0.0.254" Set the maximum source IP. - pgset "dst6 fec0::1" IPV6 destination address - pgset "src6 fec0::2" IPV6 source address - pgset "dstmac 00:00:00:00:00:00" sets MAC destination address - pgset "srcmac 00:00:00:00:00:00" sets MAC source address - - pgset "queue_map_min 0" Sets the min value of tx queue interval - pgset "queue_map_max 7" Sets the max value of tx queue interval, for multiqueue devices - To select queue 1 of a given device, - use queue_map_min=1 and queue_map_max=1 - - pgset "src_mac_count 1" Sets the number of MACs we'll range through. - The 'minimum' MAC is what you set with srcmac. - - pgset "dst_mac_count 1" Sets the number of MACs we'll range through. - The 'minimum' MAC is what you set with dstmac. - - pgset "flag [name]" Set a flag to determine behaviour. Current flags - are: IPSRC_RND # IP source is random (between min/max) - IPDST_RND # IP destination is random - UDPSRC_RND, UDPDST_RND, - MACSRC_RND, MACDST_RND - TXSIZE_RND, IPV6, - MPLS_RND, VID_RND, SVID_RND - FLOW_SEQ, - QUEUE_MAP_RND # queue map random - QUEUE_MAP_CPU # queue map mirrors smp_processor_id() - UDPCSUM, - IPSEC # IPsec encapsulation (needs CONFIG_XFRM) - NODE_ALLOC # node specific memory allocation - NO_TIMESTAMP # disable timestamping - pgset 'flag ![name]' Clear a flag to determine behaviour. - Note that you might need to use single quote in - interactive mode, so that your shell wouldn't expand - the specified flag as a history command. - - pgset "spi [SPI_VALUE]" Set specific SA used to transform packet. - - pgset "udp_src_min 9" set UDP source port min, If < udp_src_max, then - cycle through the port range. - - pgset "udp_src_max 9" set UDP source port max. - pgset "udp_dst_min 9" set UDP destination port min, If < udp_dst_max, then - cycle through the port range. - pgset "udp_dst_max 9" set UDP destination port max. - - pgset "mpls 0001000a,0002000a,0000000a" set MPLS labels (in this example - outer label=16,middle label=32, - inner label=0 (IPv4 NULL)) Note that - there must be no spaces between the - arguments. Leading zeros are required. - Do not set the bottom of stack bit, - that's done automatically. If you do - set the bottom of stack bit, that - indicates that you want to randomly - generate that address and the flag - MPLS_RND will be turned on. You - can have any mix of random and fixed - labels in the label stack. - - pgset "mpls 0" turn off mpls (or any invalid argument works too!) - - pgset "vlan_id 77" set VLAN ID 0-4095 - pgset "vlan_p 3" set priority bit 0-7 (default 0) - pgset "vlan_cfi 0" set canonical format identifier 0-1 (default 0) - - pgset "svlan_id 22" set SVLAN ID 0-4095 - pgset "svlan_p 3" set priority bit 0-7 (default 0) - pgset "svlan_cfi 0" set canonical format identifier 0-1 (default 0) - - pgset "vlan_id 9999" > 4095 remove vlan and svlan tags - pgset "svlan 9999" > 4095 remove svlan tag - - - pgset "tos XX" set former IPv4 TOS field (e.g. "tos 28" for AF11 no ECN, default 00) - pgset "traffic_class XX" set former IPv6 TRAFFIC CLASS (e.g. "traffic_class B8" for EF no ECN, default 00) - - pgset "rate 300M" set rate to 300 Mb/s - pgset "ratep 1000000" set rate to 1Mpps - - pgset "xmit_mode netif_receive" RX inject into stack netif_receive_skb() - Works with "burst" but not with "clone_skb". - Default xmit_mode is "start_xmit". - -Sample scripts -============== - -A collection of tutorial scripts and helpers for pktgen is in the -samples/pktgen directory. The helper parameters.sh file support easy -and consistent parameter parsing across the sample scripts. - -Usage example and help: - ./pktgen_sample01_simple.sh -i eth4 -m 00:1B:21:3C:9D:F8 -d 192.168.8.2 - -Usage: ./pktgen_sample01_simple.sh [-vx] -i ethX - -i : ($DEV) output interface/device (required) - -s : ($PKT_SIZE) packet size - -d : ($DEST_IP) destination IP - -m : ($DST_MAC) destination MAC-addr - -t : ($THREADS) threads to start - -c : ($SKB_CLONE) SKB clones send before alloc new SKB - -b : ($BURST) HW level bursting of SKBs - -v : ($VERBOSE) verbose - -x : ($DEBUG) debug - -The global variables being set are also listed. E.g. the required -interface/device parameter "-i" sets variable $DEV. Copy the -pktgen_sampleXX scripts and modify them to fit your own needs. - -The old scripts: - -pktgen.conf-1-2 # 1 CPU 2 dev -pktgen.conf-1-1-rdos # 1 CPU 1 dev w. route DoS -pktgen.conf-1-1-ip6 # 1 CPU 1 dev ipv6 -pktgen.conf-1-1-ip6-rdos # 1 CPU 1 dev ipv6 w. route DoS -pktgen.conf-1-1-flows # 1 CPU 1 dev multiple flows. - - -Interrupt affinity -=================== -Note that when adding devices to a specific CPU it is a good idea to -also assign /proc/irq/XX/smp_affinity so that the TX interrupts are bound -to the same CPU. This reduces cache bouncing when freeing skbs. - -Plus using the device flag QUEUE_MAP_CPU, which maps the SKBs TX queue -to the running threads CPU (directly from smp_processor_id()). - -Enable IPsec -============ -Default IPsec transformation with ESP encapsulation plus transport mode -can be enabled by simply setting: - -pgset "flag IPSEC" -pgset "flows 1" - -To avoid breaking existing testbed scripts for using AH type and tunnel mode, -you can use "pgset spi SPI_VALUE" to specify which transformation mode -to employ. - - -Current commands and configuration options -========================================== - -** Pgcontrol commands: - -start -stop -reset - -** Thread commands: - -add_device -rem_device_all - - -** Device commands: - -count -clone_skb -burst -debug - -frags -delay - -src_mac_count -dst_mac_count - -pkt_size -min_pkt_size -max_pkt_size - -queue_map_min -queue_map_max -skb_priority - -tos (ipv4) -traffic_class (ipv6) - -mpls - -udp_src_min -udp_src_max - -udp_dst_min -udp_dst_max - -node - -flag - IPSRC_RND - IPDST_RND - UDPSRC_RND - UDPDST_RND - MACSRC_RND - MACDST_RND - TXSIZE_RND - IPV6 - MPLS_RND - VID_RND - SVID_RND - FLOW_SEQ - QUEUE_MAP_RND - QUEUE_MAP_CPU - UDPCSUM - IPSEC - NODE_ALLOC - NO_TIMESTAMP - -spi (ipsec) - -dst_min -dst_max - -src_min -src_max - -dst_mac -src_mac - -clear_counters - -src6 -dst6 -dst6_max -dst6_min - -flows -flowlen - -rate -ratep - -xmit_mode - -vlan_cfi -vlan_id -vlan_p - -svlan_cfi -svlan_id -svlan_p - - -References: -ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/ -ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/examples/ - -Paper from Linux-Kongress in Erlangen 2004. -ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/pktgen_paper.pdf - -Thanks to: -Grant Grundler for testing on IA-64 and parisc, Harald Welte, Lennert Buytenhek -Stephen Hemminger, Andi Kleen, Dave Miller and many others. - - -Good luck with the linux net-development. diff --git a/net/Kconfig b/net/Kconfig index 8b1f85820a6b..c5ba2d180c43 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -344,7 +344,7 @@ config NET_PKTGEN what was just said, you don't need it: say N. Documentation on how to use the packet generator can be found - at . + at . To compile this code as a module, choose M here: the module will be called pktgen. diff --git a/net/core/pktgen.c b/net/core/pktgen.c index 08e2811b5274..b53b6d38c4df 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -56,7 +56,7 @@ * Integrated to 2.5.x 021029 --Lucio Maciel (luciomaciel@zipmail.com.br) * * 021124 Finished major redesign and rewrite for new functionality. - * See Documentation/networking/pktgen.txt for how to use this. + * See Documentation/networking/pktgen.rst for how to use this. * * The new operation: * For each CPU one thread/process is created at start. This process checks diff --git a/samples/pktgen/README.rst b/samples/pktgen/README.rst index 3f6483e8b2df..f9c53ca5cf93 100644 --- a/samples/pktgen/README.rst +++ b/samples/pktgen/README.rst @@ -3,7 +3,7 @@ Sample and benchmark scripts for pktgen (packet generator) This directory contains some pktgen sample and benchmark scripts, that can easily be copied and adjusted for your own use-case. -General doc is located in kernel: Documentation/networking/pktgen.txt +General doc is located in kernel: Documentation/networking/pktgen.rst Helper include files ==================== -- cgit v1.2.3 From 32c01266c0aab060550f592b3fe59405be8ab022 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:14 +0200 Subject: docs: networking: convert PLIP.txt to ReST - add SPDX header; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/PLIP.txt | 215 ----------------------------------- Documentation/networking/index.rst | 1 + Documentation/networking/plip.rst | 222 +++++++++++++++++++++++++++++++++++++ drivers/net/plip/Kconfig | 2 +- 4 files changed, 224 insertions(+), 216 deletions(-) delete mode 100644 Documentation/networking/PLIP.txt create mode 100644 Documentation/networking/plip.rst (limited to 'Documentation') diff --git a/Documentation/networking/PLIP.txt b/Documentation/networking/PLIP.txt deleted file mode 100644 index ad7e3f7c3bbf..000000000000 --- a/Documentation/networking/PLIP.txt +++ /dev/null @@ -1,215 +0,0 @@ -PLIP: The Parallel Line Internet Protocol Device - -Donald Becker (becker@super.org) -I.D.A. Supercomputing Research Center, Bowie MD 20715 - -At some point T. Thorn will probably contribute text, -Tommy Thorn (tthorn@daimi.aau.dk) - -PLIP Introduction ------------------ - -This document describes the parallel port packet pusher for Net/LGX. -This device interface allows a point-to-point connection between two -parallel ports to appear as a IP network interface. - -What is PLIP? -============= - -PLIP is Parallel Line IP, that is, the transportation of IP packages -over a parallel port. In the case of a PC, the obvious choice is the -printer port. PLIP is a non-standard, but [can use] uses the standard -LapLink null-printer cable [can also work in turbo mode, with a PLIP -cable]. [The protocol used to pack IP packages, is a simple one -initiated by Crynwr.] - -Advantages of PLIP -================== - -It's cheap, it's available everywhere, and it's easy. - -The PLIP cable is all that's needed to connect two Linux boxes, and it -can be built for very few bucks. - -Connecting two Linux boxes takes only a second's decision and a few -minutes' work, no need to search for a [supported] netcard. This might -even be especially important in the case of notebooks, where netcards -are not easily available. - -Not requiring a netcard also means that apart from connecting the -cables, everything else is software configuration [which in principle -could be made very easy.] - -Disadvantages of PLIP -===================== - -Doesn't work over a modem, like SLIP and PPP. Limited range, 15 m. -Can only be used to connect three (?) Linux boxes. Doesn't connect to -an existing Ethernet. Isn't standard (not even de facto standard, like -SLIP). - -Performance -=========== - -PLIP easily outperforms Ethernet cards....(ups, I was dreaming, but -it *is* getting late. EOB) - -PLIP driver details -------------------- - -The Linux PLIP driver is an implementation of the original Crynwr protocol, -that uses the parallel port subsystem of the kernel in order to properly -share parallel ports between PLIP and other services. - -IRQs and trigger timeouts -========================= - -When a parallel port used for a PLIP driver has an IRQ configured to it, the -PLIP driver is signaled whenever data is sent to it via the cable, such that -when no data is available, the driver isn't being used. - -However, on some machines it is hard, if not impossible, to configure an IRQ -to a certain parallel port, mainly because it is used by some other device. -On these machines, the PLIP driver can be used in IRQ-less mode, where -the PLIP driver would constantly poll the parallel port for data waiting, -and if such data is available, process it. This mode is less efficient than -the IRQ mode, because the driver has to check the parallel port many times -per second, even when no data at all is sent. Some rough measurements -indicate that there isn't a noticeable performance drop when using IRQ-less -mode as compared to IRQ mode as far as the data transfer speed is involved. -There is a performance drop on the machine hosting the driver. - -When the PLIP driver is used in IRQ mode, the timeout used for triggering a -data transfer (the maximal time the PLIP driver would allow the other side -before announcing a timeout, when trying to handshake a transfer of some -data) is, by default, 500usec. As IRQ delivery is more or less immediate, -this timeout is quite sufficient. - -When in IRQ-less mode, the PLIP driver polls the parallel port HZ times -per second (where HZ is typically 100 on most platforms, and 1024 on an -Alpha, as of this writing). Between two such polls, there are 10^6/HZ usecs. -On an i386, for example, 10^6/100 = 10000usec. It is easy to see that it is -quite possible for the trigger timeout to expire between two such polls, as -the timeout is only 500usec long. As a result, it is required to change the -trigger timeout on the *other* side of a PLIP connection, to about -10^6/HZ usecs. If both sides of a PLIP connection are used in IRQ-less mode, -this timeout is required on both sides. - -It appears that in practice, the trigger timeout can be shorter than in the -above calculation. It isn't an important issue, unless the wire is faulty, -in which case a long timeout would stall the machine when, for whatever -reason, bits are dropped. - -A utility that can perform this change in Linux is plipconfig, which is part -of the net-tools package (its location can be found in the -Documentation/Changes file). An example command would be -'plipconfig plipX trigger 10000', where plipX is the appropriate -PLIP device. - -PLIP hardware interconnection ------------------------------ - -PLIP uses several different data transfer methods. The first (and the -only one implemented in the early version of the code) uses a standard -printer "null" cable to transfer data four bits at a time using -data bit outputs connected to status bit inputs. - -The second data transfer method relies on both machines having -bi-directional parallel ports, rather than output-only ``printer'' -ports. This allows byte-wide transfers and avoids reconstructing -nibbles into bytes, leading to much faster transfers. - -Parallel Transfer Mode 0 Cable -============================== - -The cable for the first transfer mode is a standard -printer "null" cable which transfers data four bits at a time using -data bit outputs of the first port (machine T) connected to the -status bit inputs of the second port (machine R). There are five -status inputs, and they are used as four data inputs and a clock (data -strobe) input, arranged so that the data input bits appear as contiguous -bits with standard status register implementation. - -A cable that implements this protocol is available commercially as a -"Null Printer" or "Turbo Laplink" cable. It can be constructed with -two DB-25 male connectors symmetrically connected as follows: - - STROBE output 1* - D0->ERROR 2 - 15 15 - 2 - D1->SLCT 3 - 13 13 - 3 - D2->PAPOUT 4 - 12 12 - 4 - D3->ACK 5 - 10 10 - 5 - D4->BUSY 6 - 11 11 - 6 - D5,D6,D7 are 7*, 8*, 9* - AUTOFD output 14* - INIT output 16* - SLCTIN 17 - 17 - extra grounds are 18*,19*,20*,21*,22*,23*,24* - GROUND 25 - 25 -* Do not connect these pins on either end - -If the cable you are using has a metallic shield it should be -connected to the metallic DB-25 shell at one end only. - -Parallel Transfer Mode 1 -======================== - -The second data transfer method relies on both machines having -bi-directional parallel ports, rather than output-only ``printer'' -ports. This allows byte-wide transfers, and avoids reconstructing -nibbles into bytes. This cable should not be used on unidirectional -``printer'' (as opposed to ``parallel'') ports or when the machine -isn't configured for PLIP, as it will result in output driver -conflicts and the (unlikely) possibility of damage. - -The cable for this transfer mode should be constructed as follows: - - STROBE->BUSY 1 - 11 - D0->D0 2 - 2 - D1->D1 3 - 3 - D2->D2 4 - 4 - D3->D3 5 - 5 - D4->D4 6 - 6 - D5->D5 7 - 7 - D6->D6 8 - 8 - D7->D7 9 - 9 - INIT -> ACK 16 - 10 - AUTOFD->PAPOUT 14 - 12 - SLCT->SLCTIN 13 - 17 - GND->ERROR 18 - 15 - extra grounds are 19*,20*,21*,22*,23*,24* - GROUND 25 - 25 -* Do not connect these pins on either end - -Once again, if the cable you are using has a metallic shield it should -be connected to the metallic DB-25 shell at one end only. - -PLIP Mode 0 transfer protocol -============================= - -The PLIP driver is compatible with the "Crynwr" parallel port transfer -standard in Mode 0. That standard specifies the following protocol: - - send header nibble '0x8' - count-low octet - count-high octet - ... data octets - checksum octet - -Each octet is sent as - - >4)&0x0F)> - -To start a transfer the transmitting machine outputs a nibble 0x08. -That raises the ACK line, triggering an interrupt in the receiving -machine. The receiving machine disables interrupts and raises its own ACK -line. - -Restated: - -(OUT is bit 0-4, OUT.j is bit j from OUT. IN likewise) -Send_Byte: - OUT := low nibble, OUT.4 := 1 - WAIT FOR IN.4 = 1 - OUT := high nibble, OUT.4 := 0 - WAIT FOR IN.4 = 0 diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 696181a96e3c..18bb10239cad 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -92,6 +92,7 @@ Contents: packet_mmap phonet pktgen + plip .. only:: subproject and html diff --git a/Documentation/networking/plip.rst b/Documentation/networking/plip.rst new file mode 100644 index 000000000000..0eda745050ff --- /dev/null +++ b/Documentation/networking/plip.rst @@ -0,0 +1,222 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================================ +PLIP: The Parallel Line Internet Protocol Device +================================================ + +Donald Becker (becker@super.org) +I.D.A. Supercomputing Research Center, Bowie MD 20715 + +At some point T. Thorn will probably contribute text, +Tommy Thorn (tthorn@daimi.aau.dk) + +PLIP Introduction +----------------- + +This document describes the parallel port packet pusher for Net/LGX. +This device interface allows a point-to-point connection between two +parallel ports to appear as a IP network interface. + +What is PLIP? +============= + +PLIP is Parallel Line IP, that is, the transportation of IP packages +over a parallel port. In the case of a PC, the obvious choice is the +printer port. PLIP is a non-standard, but [can use] uses the standard +LapLink null-printer cable [can also work in turbo mode, with a PLIP +cable]. [The protocol used to pack IP packages, is a simple one +initiated by Crynwr.] + +Advantages of PLIP +================== + +It's cheap, it's available everywhere, and it's easy. + +The PLIP cable is all that's needed to connect two Linux boxes, and it +can be built for very few bucks. + +Connecting two Linux boxes takes only a second's decision and a few +minutes' work, no need to search for a [supported] netcard. This might +even be especially important in the case of notebooks, where netcards +are not easily available. + +Not requiring a netcard also means that apart from connecting the +cables, everything else is software configuration [which in principle +could be made very easy.] + +Disadvantages of PLIP +===================== + +Doesn't work over a modem, like SLIP and PPP. Limited range, 15 m. +Can only be used to connect three (?) Linux boxes. Doesn't connect to +an existing Ethernet. Isn't standard (not even de facto standard, like +SLIP). + +Performance +=========== + +PLIP easily outperforms Ethernet cards....(ups, I was dreaming, but +it *is* getting late. EOB) + +PLIP driver details +------------------- + +The Linux PLIP driver is an implementation of the original Crynwr protocol, +that uses the parallel port subsystem of the kernel in order to properly +share parallel ports between PLIP and other services. + +IRQs and trigger timeouts +========================= + +When a parallel port used for a PLIP driver has an IRQ configured to it, the +PLIP driver is signaled whenever data is sent to it via the cable, such that +when no data is available, the driver isn't being used. + +However, on some machines it is hard, if not impossible, to configure an IRQ +to a certain parallel port, mainly because it is used by some other device. +On these machines, the PLIP driver can be used in IRQ-less mode, where +the PLIP driver would constantly poll the parallel port for data waiting, +and if such data is available, process it. This mode is less efficient than +the IRQ mode, because the driver has to check the parallel port many times +per second, even when no data at all is sent. Some rough measurements +indicate that there isn't a noticeable performance drop when using IRQ-less +mode as compared to IRQ mode as far as the data transfer speed is involved. +There is a performance drop on the machine hosting the driver. + +When the PLIP driver is used in IRQ mode, the timeout used for triggering a +data transfer (the maximal time the PLIP driver would allow the other side +before announcing a timeout, when trying to handshake a transfer of some +data) is, by default, 500usec. As IRQ delivery is more or less immediate, +this timeout is quite sufficient. + +When in IRQ-less mode, the PLIP driver polls the parallel port HZ times +per second (where HZ is typically 100 on most platforms, and 1024 on an +Alpha, as of this writing). Between two such polls, there are 10^6/HZ usecs. +On an i386, for example, 10^6/100 = 10000usec. It is easy to see that it is +quite possible for the trigger timeout to expire between two such polls, as +the timeout is only 500usec long. As a result, it is required to change the +trigger timeout on the *other* side of a PLIP connection, to about +10^6/HZ usecs. If both sides of a PLIP connection are used in IRQ-less mode, +this timeout is required on both sides. + +It appears that in practice, the trigger timeout can be shorter than in the +above calculation. It isn't an important issue, unless the wire is faulty, +in which case a long timeout would stall the machine when, for whatever +reason, bits are dropped. + +A utility that can perform this change in Linux is plipconfig, which is part +of the net-tools package (its location can be found in the +Documentation/Changes file). An example command would be +'plipconfig plipX trigger 10000', where plipX is the appropriate +PLIP device. + +PLIP hardware interconnection +----------------------------- + +PLIP uses several different data transfer methods. The first (and the +only one implemented in the early version of the code) uses a standard +printer "null" cable to transfer data four bits at a time using +data bit outputs connected to status bit inputs. + +The second data transfer method relies on both machines having +bi-directional parallel ports, rather than output-only ``printer`` +ports. This allows byte-wide transfers and avoids reconstructing +nibbles into bytes, leading to much faster transfers. + +Parallel Transfer Mode 0 Cable +============================== + +The cable for the first transfer mode is a standard +printer "null" cable which transfers data four bits at a time using +data bit outputs of the first port (machine T) connected to the +status bit inputs of the second port (machine R). There are five +status inputs, and they are used as four data inputs and a clock (data +strobe) input, arranged so that the data input bits appear as contiguous +bits with standard status register implementation. + +A cable that implements this protocol is available commercially as a +"Null Printer" or "Turbo Laplink" cable. It can be constructed with +two DB-25 male connectors symmetrically connected as follows:: + + STROBE output 1* + D0->ERROR 2 - 15 15 - 2 + D1->SLCT 3 - 13 13 - 3 + D2->PAPOUT 4 - 12 12 - 4 + D3->ACK 5 - 10 10 - 5 + D4->BUSY 6 - 11 11 - 6 + D5,D6,D7 are 7*, 8*, 9* + AUTOFD output 14* + INIT output 16* + SLCTIN 17 - 17 + extra grounds are 18*,19*,20*,21*,22*,23*,24* + GROUND 25 - 25 + + * Do not connect these pins on either end + +If the cable you are using has a metallic shield it should be +connected to the metallic DB-25 shell at one end only. + +Parallel Transfer Mode 1 +======================== + +The second data transfer method relies on both machines having +bi-directional parallel ports, rather than output-only ``printer`` +ports. This allows byte-wide transfers, and avoids reconstructing +nibbles into bytes. This cable should not be used on unidirectional +``printer`` (as opposed to ``parallel``) ports or when the machine +isn't configured for PLIP, as it will result in output driver +conflicts and the (unlikely) possibility of damage. + +The cable for this transfer mode should be constructed as follows:: + + STROBE->BUSY 1 - 11 + D0->D0 2 - 2 + D1->D1 3 - 3 + D2->D2 4 - 4 + D3->D3 5 - 5 + D4->D4 6 - 6 + D5->D5 7 - 7 + D6->D6 8 - 8 + D7->D7 9 - 9 + INIT -> ACK 16 - 10 + AUTOFD->PAPOUT 14 - 12 + SLCT->SLCTIN 13 - 17 + GND->ERROR 18 - 15 + extra grounds are 19*,20*,21*,22*,23*,24* + GROUND 25 - 25 + + * Do not connect these pins on either end + +Once again, if the cable you are using has a metallic shield it should +be connected to the metallic DB-25 shell at one end only. + +PLIP Mode 0 transfer protocol +============================= + +The PLIP driver is compatible with the "Crynwr" parallel port transfer +standard in Mode 0. That standard specifies the following protocol:: + + send header nibble '0x8' + count-low octet + count-high octet + ... data octets + checksum octet + +Each octet is sent as:: + + + >4)&0x0F)> + +To start a transfer the transmitting machine outputs a nibble 0x08. +That raises the ACK line, triggering an interrupt in the receiving +machine. The receiving machine disables interrupts and raises its own ACK +line. + +Restated:: + + (OUT is bit 0-4, OUT.j is bit j from OUT. IN likewise) + Send_Byte: + OUT := low nibble, OUT.4 := 1 + WAIT FOR IN.4 = 1 + OUT := high nibble, OUT.4 := 0 + WAIT FOR IN.4 = 0 diff --git a/drivers/net/plip/Kconfig b/drivers/net/plip/Kconfig index b41035be2d51..e03556d1d0c2 100644 --- a/drivers/net/plip/Kconfig +++ b/drivers/net/plip/Kconfig @@ -21,7 +21,7 @@ config PLIP bits at a time (mode 0) or with special PLIP cables, to be used on bidirectional parallel ports only, which can transmit 8 bits at a time (mode 1); you can find the wiring of these cables in - . The cables can be up to + . The cables can be up to 15m long. Mode 0 works also if one of the machines runs DOS/Windows and has some PLIP software installed, e.g. the Crynwr PLIP packet driver () -- cgit v1.2.3 From 71120802ebeda1e645baf673b958978c4000a695 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:15 +0200 Subject: docs: networking: convert ppp_generic.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/ppp_generic.rst | 440 +++++++++++++++++++++++++++++++ Documentation/networking/ppp_generic.txt | 428 ------------------------------ 3 files changed, 441 insertions(+), 428 deletions(-) create mode 100644 Documentation/networking/ppp_generic.rst delete mode 100644 Documentation/networking/ppp_generic.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 18bb10239cad..f89535871481 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -93,6 +93,7 @@ Contents: phonet pktgen plip + ppp_generic .. only:: subproject and html diff --git a/Documentation/networking/ppp_generic.rst b/Documentation/networking/ppp_generic.rst new file mode 100644 index 000000000000..e60504377900 --- /dev/null +++ b/Documentation/networking/ppp_generic.rst @@ -0,0 +1,440 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================== +PPP Generic Driver and Channel Interface +======================================== + + Paul Mackerras + paulus@samba.org + + 7 Feb 2002 + +The generic PPP driver in linux-2.4 provides an implementation of the +functionality which is of use in any PPP implementation, including: + +* the network interface unit (ppp0 etc.) +* the interface to the networking code +* PPP multilink: splitting datagrams between multiple links, and + ordering and combining received fragments +* the interface to pppd, via a /dev/ppp character device +* packet compression and decompression +* TCP/IP header compression and decompression +* detecting network traffic for demand dialling and for idle timeouts +* simple packet filtering + +For sending and receiving PPP frames, the generic PPP driver calls on +the services of PPP ``channels``. A PPP channel encapsulates a +mechanism for transporting PPP frames from one machine to another. A +PPP channel implementation can be arbitrarily complex internally but +has a very simple interface with the generic PPP code: it merely has +to be able to send PPP frames, receive PPP frames, and optionally +handle ioctl requests. Currently there are PPP channel +implementations for asynchronous serial ports, synchronous serial +ports, and for PPP over ethernet. + +This architecture makes it possible to implement PPP multilink in a +natural and straightforward way, by allowing more than one channel to +be linked to each ppp network interface unit. The generic layer is +responsible for splitting datagrams on transmit and recombining them +on receive. + + +PPP channel API +--------------- + +See include/linux/ppp_channel.h for the declaration of the types and +functions used to communicate between the generic PPP layer and PPP +channels. + +Each channel has to provide two functions to the generic PPP layer, +via the ppp_channel.ops pointer: + +* start_xmit() is called by the generic layer when it has a frame to + send. The channel has the option of rejecting the frame for + flow-control reasons. In this case, start_xmit() should return 0 + and the channel should call the ppp_output_wakeup() function at a + later time when it can accept frames again, and the generic layer + will then attempt to retransmit the rejected frame(s). If the frame + is accepted, the start_xmit() function should return 1. + +* ioctl() provides an interface which can be used by a user-space + program to control aspects of the channel's behaviour. This + procedure will be called when a user-space program does an ioctl + system call on an instance of /dev/ppp which is bound to the + channel. (Usually it would only be pppd which would do this.) + +The generic PPP layer provides seven functions to channels: + +* ppp_register_channel() is called when a channel has been created, to + notify the PPP generic layer of its presence. For example, setting + a serial port to the PPPDISC line discipline causes the ppp_async + channel code to call this function. + +* ppp_unregister_channel() is called when a channel is to be + destroyed. For example, the ppp_async channel code calls this when + a hangup is detected on the serial port. + +* ppp_output_wakeup() is called by a channel when it has previously + rejected a call to its start_xmit function, and can now accept more + packets. + +* ppp_input() is called by a channel when it has received a complete + PPP frame. + +* ppp_input_error() is called by a channel when it has detected that a + frame has been lost or dropped (for example, because of a FCS (frame + check sequence) error). + +* ppp_channel_index() returns the channel index assigned by the PPP + generic layer to this channel. The channel should provide some way + (e.g. an ioctl) to transmit this back to user-space, as user-space + will need it to attach an instance of /dev/ppp to this channel. + +* ppp_unit_number() returns the unit number of the ppp network + interface to which this channel is connected, or -1 if the channel + is not connected. + +Connecting a channel to the ppp generic layer is initiated from the +channel code, rather than from the generic layer. The channel is +expected to have some way for a user-level process to control it +independently of the ppp generic layer. For example, with the +ppp_async channel, this is provided by the file descriptor to the +serial port. + +Generally a user-level process will initialize the underlying +communications medium and prepare it to do PPP. For example, with an +async tty, this can involve setting the tty speed and modes, issuing +modem commands, and then going through some sort of dialog with the +remote system to invoke PPP service there. We refer to this process +as ``discovery``. Then the user-level process tells the medium to +become a PPP channel and register itself with the generic PPP layer. +The channel then has to report the channel number assigned to it back +to the user-level process. From that point, the PPP negotiation code +in the PPP daemon (pppd) can take over and perform the PPP +negotiation, accessing the channel through the /dev/ppp interface. + +At the interface to the PPP generic layer, PPP frames are stored in +skbuff structures and start with the two-byte PPP protocol number. +The frame does *not* include the 0xff ``address`` byte or the 0x03 +``control`` byte that are optionally used in async PPP. Nor is there +any escaping of control characters, nor are there any FCS or framing +characters included. That is all the responsibility of the channel +code, if it is needed for the particular medium. That is, the skbuffs +presented to the start_xmit() function contain only the 2-byte +protocol number and the data, and the skbuffs presented to ppp_input() +must be in the same format. + +The channel must provide an instance of a ppp_channel struct to +represent the channel. The channel is free to use the ``private`` field +however it wishes. The channel should initialize the ``mtu`` and +``hdrlen`` fields before calling ppp_register_channel() and not change +them until after ppp_unregister_channel() returns. The ``mtu`` field +represents the maximum size of the data part of the PPP frames, that +is, it does not include the 2-byte protocol number. + +If the channel needs some headroom in the skbuffs presented to it for +transmission (i.e., some space free in the skbuff data area before the +start of the PPP frame), it should set the ``hdrlen`` field of the +ppp_channel struct to the amount of headroom required. The generic +PPP layer will attempt to provide that much headroom but the channel +should still check if there is sufficient headroom and copy the skbuff +if there isn't. + +On the input side, channels should ideally provide at least 2 bytes of +headroom in the skbuffs presented to ppp_input(). The generic PPP +code does not require this but will be more efficient if this is done. + + +Buffering and flow control +-------------------------- + +The generic PPP layer has been designed to minimize the amount of data +that it buffers in the transmit direction. It maintains a queue of +transmit packets for the PPP unit (network interface device) plus a +queue of transmit packets for each attached channel. Normally the +transmit queue for the unit will contain at most one packet; the +exceptions are when pppd sends packets by writing to /dev/ppp, and +when the core networking code calls the generic layer's start_xmit() +function with the queue stopped, i.e. when the generic layer has +called netif_stop_queue(), which only happens on a transmit timeout. +The start_xmit function always accepts and queues the packet which it +is asked to transmit. + +Transmit packets are dequeued from the PPP unit transmit queue and +then subjected to TCP/IP header compression and packet compression +(Deflate or BSD-Compress compression), as appropriate. After this +point the packets can no longer be reordered, as the decompression +algorithms rely on receiving compressed packets in the same order that +they were generated. + +If multilink is not in use, this packet is then passed to the attached +channel's start_xmit() function. If the channel refuses to take +the packet, the generic layer saves it for later transmission. The +generic layer will call the channel's start_xmit() function again +when the channel calls ppp_output_wakeup() or when the core +networking code calls the generic layer's start_xmit() function +again. The generic layer contains no timeout and retransmission +logic; it relies on the core networking code for that. + +If multilink is in use, the generic layer divides the packet into one +or more fragments and puts a multilink header on each fragment. It +decides how many fragments to use based on the length of the packet +and the number of channels which are potentially able to accept a +fragment at the moment. A channel is potentially able to accept a +fragment if it doesn't have any fragments currently queued up for it +to transmit. The channel may still refuse a fragment; in this case +the fragment is queued up for the channel to transmit later. This +scheme has the effect that more fragments are given to higher- +bandwidth channels. It also means that under light load, the generic +layer will tend to fragment large packets across all the channels, +thus reducing latency, while under heavy load, packets will tend to be +transmitted as single fragments, thus reducing the overhead of +fragmentation. + + +SMP safety +---------- + +The PPP generic layer has been designed to be SMP-safe. Locks are +used around accesses to the internal data structures where necessary +to ensure their integrity. As part of this, the generic layer +requires that the channels adhere to certain requirements and in turn +provides certain guarantees to the channels. Essentially the channels +are required to provide the appropriate locking on the ppp_channel +structures that form the basis of the communication between the +channel and the generic layer. This is because the channel provides +the storage for the ppp_channel structure, and so the channel is +required to provide the guarantee that this storage exists and is +valid at the appropriate times. + +The generic layer requires these guarantees from the channel: + +* The ppp_channel object must exist from the time that + ppp_register_channel() is called until after the call to + ppp_unregister_channel() returns. + +* No thread may be in a call to any of ppp_input(), ppp_input_error(), + ppp_output_wakeup(), ppp_channel_index() or ppp_unit_number() for a + channel at the time that ppp_unregister_channel() is called for that + channel. + +* ppp_register_channel() and ppp_unregister_channel() must be called + from process context, not interrupt or softirq/BH context. + +* The remaining generic layer functions may be called at softirq/BH + level but must not be called from a hardware interrupt handler. + +* The generic layer may call the channel start_xmit() function at + softirq/BH level but will not call it at interrupt level. Thus the + start_xmit() function may not block. + +* The generic layer will only call the channel ioctl() function in + process context. + +The generic layer provides these guarantees to the channels: + +* The generic layer will not call the start_xmit() function for a + channel while any thread is already executing in that function for + that channel. + +* The generic layer will not call the ioctl() function for a channel + while any thread is already executing in that function for that + channel. + +* By the time a call to ppp_unregister_channel() returns, no thread + will be executing in a call from the generic layer to that channel's + start_xmit() or ioctl() function, and the generic layer will not + call either of those functions subsequently. + + +Interface to pppd +----------------- + +The PPP generic layer exports a character device interface called +/dev/ppp. This is used by pppd to control PPP interface units and +channels. Although there is only one /dev/ppp, each open instance of +/dev/ppp acts independently and can be attached either to a PPP unit +or a PPP channel. This is achieved using the file->private_data field +to point to a separate object for each open instance of /dev/ppp. In +this way an effect similar to Solaris' clone open is obtained, +allowing us to control an arbitrary number of PPP interfaces and +channels without having to fill up /dev with hundreds of device names. + +When /dev/ppp is opened, a new instance is created which is initially +unattached. Using an ioctl call, it can then be attached to an +existing unit, attached to a newly-created unit, or attached to an +existing channel. An instance attached to a unit can be used to send +and receive PPP control frames, using the read() and write() system +calls, along with poll() if necessary. Similarly, an instance +attached to a channel can be used to send and receive PPP frames on +that channel. + +In multilink terms, the unit represents the bundle, while the channels +represent the individual physical links. Thus, a PPP frame sent by a +write to the unit (i.e., to an instance of /dev/ppp attached to the +unit) will be subject to bundle-level compression and to fragmentation +across the individual links (if multilink is in use). In contrast, a +PPP frame sent by a write to the channel will be sent as-is on that +channel, without any multilink header. + +A channel is not initially attached to any unit. In this state it can +be used for PPP negotiation but not for the transfer of data packets. +It can then be connected to a PPP unit with an ioctl call, which +makes it available to send and receive data packets for that unit. + +The ioctl calls which are available on an instance of /dev/ppp depend +on whether it is unattached, attached to a PPP interface, or attached +to a PPP channel. The ioctl calls which are available on an +unattached instance are: + +* PPPIOCNEWUNIT creates a new PPP interface and makes this /dev/ppp + instance the "owner" of the interface. The argument should point to + an int which is the desired unit number if >= 0, or -1 to assign the + lowest unused unit number. Being the owner of the interface means + that the interface will be shut down if this instance of /dev/ppp is + closed. + +* PPPIOCATTACH attaches this instance to an existing PPP interface. + The argument should point to an int containing the unit number. + This does not make this instance the owner of the PPP interface. + +* PPPIOCATTCHAN attaches this instance to an existing PPP channel. + The argument should point to an int containing the channel number. + +The ioctl calls available on an instance of /dev/ppp attached to a +channel are: + +* PPPIOCCONNECT connects this channel to a PPP interface. The + argument should point to an int containing the interface unit + number. It will return an EINVAL error if the channel is already + connected to an interface, or ENXIO if the requested interface does + not exist. + +* PPPIOCDISCONN disconnects this channel from the PPP interface that + it is connected to. It will return an EINVAL error if the channel + is not connected to an interface. + +* All other ioctl commands are passed to the channel ioctl() function. + +The ioctl calls that are available on an instance that is attached to +an interface unit are: + +* PPPIOCSMRU sets the MRU (maximum receive unit) for the interface. + The argument should point to an int containing the new MRU value. + +* PPPIOCSFLAGS sets flags which control the operation of the + interface. The argument should be a pointer to an int containing + the new flags value. The bits in the flags value that can be set + are: + + ================ ======================================== + SC_COMP_TCP enable transmit TCP header compression + SC_NO_TCP_CCID disable connection-id compression for + TCP header compression + SC_REJ_COMP_TCP disable receive TCP header decompression + SC_CCP_OPEN Compression Control Protocol (CCP) is + open, so inspect CCP packets + SC_CCP_UP CCP is up, may (de)compress packets + SC_LOOP_TRAFFIC send IP traffic to pppd + SC_MULTILINK enable PPP multilink fragmentation on + transmitted packets + SC_MP_SHORTSEQ expect short multilink sequence + numbers on received multilink fragments + SC_MP_XSHORTSEQ transmit short multilink sequence nos. + ================ ======================================== + + The values of these flags are defined in . Note + that the values of the SC_MULTILINK, SC_MP_SHORTSEQ and + SC_MP_XSHORTSEQ bits are ignored if the CONFIG_PPP_MULTILINK option + is not selected. + +* PPPIOCGFLAGS returns the value of the status/control flags for the + interface unit. The argument should point to an int where the ioctl + will store the flags value. As well as the values listed above for + PPPIOCSFLAGS, the following bits may be set in the returned value: + + ================ ========================================= + SC_COMP_RUN CCP compressor is running + SC_DECOMP_RUN CCP decompressor is running + SC_DC_ERROR CCP decompressor detected non-fatal error + SC_DC_FERROR CCP decompressor detected fatal error + ================ ========================================= + +* PPPIOCSCOMPRESS sets the parameters for packet compression or + decompression. The argument should point to a ppp_option_data + structure (defined in ), which contains a + pointer/length pair which should describe a block of memory + containing a CCP option specifying a compression method and its + parameters. The ppp_option_data struct also contains a ``transmit`` + field. If this is 0, the ioctl will affect the receive path, + otherwise the transmit path. + +* PPPIOCGUNIT returns, in the int pointed to by the argument, the unit + number of this interface unit. + +* PPPIOCSDEBUG sets the debug flags for the interface to the value in + the int pointed to by the argument. Only the least significant bit + is used; if this is 1 the generic layer will print some debug + messages during its operation. This is only intended for debugging + the generic PPP layer code; it is generally not helpful for working + out why a PPP connection is failing. + +* PPPIOCGDEBUG returns the debug flags for the interface in the int + pointed to by the argument. + +* PPPIOCGIDLE returns the time, in seconds, since the last data + packets were sent and received. The argument should point to a + ppp_idle structure (defined in ). If the + CONFIG_PPP_FILTER option is enabled, the set of packets which reset + the transmit and receive idle timers is restricted to those which + pass the ``active`` packet filter. + Two versions of this command exist, to deal with user space + expecting times as either 32-bit or 64-bit time_t seconds. + +* PPPIOCSMAXCID sets the maximum connection-ID parameter (and thus the + number of connection slots) for the TCP header compressor and + decompressor. The lower 16 bits of the int pointed to by the + argument specify the maximum connection-ID for the compressor. If + the upper 16 bits of that int are non-zero, they specify the maximum + connection-ID for the decompressor, otherwise the decompressor's + maximum connection-ID is set to 15. + +* PPPIOCSNPMODE sets the network-protocol mode for a given network + protocol. The argument should point to an npioctl struct (defined + in ). The ``protocol`` field gives the PPP protocol + number for the protocol to be affected, and the ``mode`` field + specifies what to do with packets for that protocol: + + ============= ============================================== + NPMODE_PASS normal operation, transmit and receive packets + NPMODE_DROP silently drop packets for this protocol + NPMODE_ERROR drop packets and return an error on transmit + NPMODE_QUEUE queue up packets for transmit, drop received + packets + ============= ============================================== + + At present NPMODE_ERROR and NPMODE_QUEUE have the same effect as + NPMODE_DROP. + +* PPPIOCGNPMODE returns the network-protocol mode for a given + protocol. The argument should point to an npioctl struct with the + ``protocol`` field set to the PPP protocol number for the protocol of + interest. On return the ``mode`` field will be set to the network- + protocol mode for that protocol. + +* PPPIOCSPASS and PPPIOCSACTIVE set the ``pass`` and ``active`` packet + filters. These ioctls are only available if the CONFIG_PPP_FILTER + option is selected. The argument should point to a sock_fprog + structure (defined in ) containing the compiled BPF + instructions for the filter. Packets are dropped if they fail the + ``pass`` filter; otherwise, if they fail the ``active`` filter they are + passed but they do not reset the transmit or receive idle timer. + +* PPPIOCSMRRU enables or disables multilink processing for received + packets and sets the multilink MRRU (maximum reconstructed receive + unit). The argument should point to an int containing the new MRRU + value. If the MRRU value is 0, processing of received multilink + fragments is disabled. This ioctl is only available if the + CONFIG_PPP_MULTILINK option is selected. + +Last modified: 7-feb-2002 diff --git a/Documentation/networking/ppp_generic.txt b/Documentation/networking/ppp_generic.txt deleted file mode 100644 index fd563aff5fc9..000000000000 --- a/Documentation/networking/ppp_generic.txt +++ /dev/null @@ -1,428 +0,0 @@ - PPP Generic Driver and Channel Interface - ---------------------------------------- - - Paul Mackerras - paulus@samba.org - 7 Feb 2002 - -The generic PPP driver in linux-2.4 provides an implementation of the -functionality which is of use in any PPP implementation, including: - -* the network interface unit (ppp0 etc.) -* the interface to the networking code -* PPP multilink: splitting datagrams between multiple links, and - ordering and combining received fragments -* the interface to pppd, via a /dev/ppp character device -* packet compression and decompression -* TCP/IP header compression and decompression -* detecting network traffic for demand dialling and for idle timeouts -* simple packet filtering - -For sending and receiving PPP frames, the generic PPP driver calls on -the services of PPP `channels'. A PPP channel encapsulates a -mechanism for transporting PPP frames from one machine to another. A -PPP channel implementation can be arbitrarily complex internally but -has a very simple interface with the generic PPP code: it merely has -to be able to send PPP frames, receive PPP frames, and optionally -handle ioctl requests. Currently there are PPP channel -implementations for asynchronous serial ports, synchronous serial -ports, and for PPP over ethernet. - -This architecture makes it possible to implement PPP multilink in a -natural and straightforward way, by allowing more than one channel to -be linked to each ppp network interface unit. The generic layer is -responsible for splitting datagrams on transmit and recombining them -on receive. - - -PPP channel API ---------------- - -See include/linux/ppp_channel.h for the declaration of the types and -functions used to communicate between the generic PPP layer and PPP -channels. - -Each channel has to provide two functions to the generic PPP layer, -via the ppp_channel.ops pointer: - -* start_xmit() is called by the generic layer when it has a frame to - send. The channel has the option of rejecting the frame for - flow-control reasons. In this case, start_xmit() should return 0 - and the channel should call the ppp_output_wakeup() function at a - later time when it can accept frames again, and the generic layer - will then attempt to retransmit the rejected frame(s). If the frame - is accepted, the start_xmit() function should return 1. - -* ioctl() provides an interface which can be used by a user-space - program to control aspects of the channel's behaviour. This - procedure will be called when a user-space program does an ioctl - system call on an instance of /dev/ppp which is bound to the - channel. (Usually it would only be pppd which would do this.) - -The generic PPP layer provides seven functions to channels: - -* ppp_register_channel() is called when a channel has been created, to - notify the PPP generic layer of its presence. For example, setting - a serial port to the PPPDISC line discipline causes the ppp_async - channel code to call this function. - -* ppp_unregister_channel() is called when a channel is to be - destroyed. For example, the ppp_async channel code calls this when - a hangup is detected on the serial port. - -* ppp_output_wakeup() is called by a channel when it has previously - rejected a call to its start_xmit function, and can now accept more - packets. - -* ppp_input() is called by a channel when it has received a complete - PPP frame. - -* ppp_input_error() is called by a channel when it has detected that a - frame has been lost or dropped (for example, because of a FCS (frame - check sequence) error). - -* ppp_channel_index() returns the channel index assigned by the PPP - generic layer to this channel. The channel should provide some way - (e.g. an ioctl) to transmit this back to user-space, as user-space - will need it to attach an instance of /dev/ppp to this channel. - -* ppp_unit_number() returns the unit number of the ppp network - interface to which this channel is connected, or -1 if the channel - is not connected. - -Connecting a channel to the ppp generic layer is initiated from the -channel code, rather than from the generic layer. The channel is -expected to have some way for a user-level process to control it -independently of the ppp generic layer. For example, with the -ppp_async channel, this is provided by the file descriptor to the -serial port. - -Generally a user-level process will initialize the underlying -communications medium and prepare it to do PPP. For example, with an -async tty, this can involve setting the tty speed and modes, issuing -modem commands, and then going through some sort of dialog with the -remote system to invoke PPP service there. We refer to this process -as `discovery'. Then the user-level process tells the medium to -become a PPP channel and register itself with the generic PPP layer. -The channel then has to report the channel number assigned to it back -to the user-level process. From that point, the PPP negotiation code -in the PPP daemon (pppd) can take over and perform the PPP -negotiation, accessing the channel through the /dev/ppp interface. - -At the interface to the PPP generic layer, PPP frames are stored in -skbuff structures and start with the two-byte PPP protocol number. -The frame does *not* include the 0xff `address' byte or the 0x03 -`control' byte that are optionally used in async PPP. Nor is there -any escaping of control characters, nor are there any FCS or framing -characters included. That is all the responsibility of the channel -code, if it is needed for the particular medium. That is, the skbuffs -presented to the start_xmit() function contain only the 2-byte -protocol number and the data, and the skbuffs presented to ppp_input() -must be in the same format. - -The channel must provide an instance of a ppp_channel struct to -represent the channel. The channel is free to use the `private' field -however it wishes. The channel should initialize the `mtu' and -`hdrlen' fields before calling ppp_register_channel() and not change -them until after ppp_unregister_channel() returns. The `mtu' field -represents the maximum size of the data part of the PPP frames, that -is, it does not include the 2-byte protocol number. - -If the channel needs some headroom in the skbuffs presented to it for -transmission (i.e., some space free in the skbuff data area before the -start of the PPP frame), it should set the `hdrlen' field of the -ppp_channel struct to the amount of headroom required. The generic -PPP layer will attempt to provide that much headroom but the channel -should still check if there is sufficient headroom and copy the skbuff -if there isn't. - -On the input side, channels should ideally provide at least 2 bytes of -headroom in the skbuffs presented to ppp_input(). The generic PPP -code does not require this but will be more efficient if this is done. - - -Buffering and flow control --------------------------- - -The generic PPP layer has been designed to minimize the amount of data -that it buffers in the transmit direction. It maintains a queue of -transmit packets for the PPP unit (network interface device) plus a -queue of transmit packets for each attached channel. Normally the -transmit queue for the unit will contain at most one packet; the -exceptions are when pppd sends packets by writing to /dev/ppp, and -when the core networking code calls the generic layer's start_xmit() -function with the queue stopped, i.e. when the generic layer has -called netif_stop_queue(), which only happens on a transmit timeout. -The start_xmit function always accepts and queues the packet which it -is asked to transmit. - -Transmit packets are dequeued from the PPP unit transmit queue and -then subjected to TCP/IP header compression and packet compression -(Deflate or BSD-Compress compression), as appropriate. After this -point the packets can no longer be reordered, as the decompression -algorithms rely on receiving compressed packets in the same order that -they were generated. - -If multilink is not in use, this packet is then passed to the attached -channel's start_xmit() function. If the channel refuses to take -the packet, the generic layer saves it for later transmission. The -generic layer will call the channel's start_xmit() function again -when the channel calls ppp_output_wakeup() or when the core -networking code calls the generic layer's start_xmit() function -again. The generic layer contains no timeout and retransmission -logic; it relies on the core networking code for that. - -If multilink is in use, the generic layer divides the packet into one -or more fragments and puts a multilink header on each fragment. It -decides how many fragments to use based on the length of the packet -and the number of channels which are potentially able to accept a -fragment at the moment. A channel is potentially able to accept a -fragment if it doesn't have any fragments currently queued up for it -to transmit. The channel may still refuse a fragment; in this case -the fragment is queued up for the channel to transmit later. This -scheme has the effect that more fragments are given to higher- -bandwidth channels. It also means that under light load, the generic -layer will tend to fragment large packets across all the channels, -thus reducing latency, while under heavy load, packets will tend to be -transmitted as single fragments, thus reducing the overhead of -fragmentation. - - -SMP safety ----------- - -The PPP generic layer has been designed to be SMP-safe. Locks are -used around accesses to the internal data structures where necessary -to ensure their integrity. As part of this, the generic layer -requires that the channels adhere to certain requirements and in turn -provides certain guarantees to the channels. Essentially the channels -are required to provide the appropriate locking on the ppp_channel -structures that form the basis of the communication between the -channel and the generic layer. This is because the channel provides -the storage for the ppp_channel structure, and so the channel is -required to provide the guarantee that this storage exists and is -valid at the appropriate times. - -The generic layer requires these guarantees from the channel: - -* The ppp_channel object must exist from the time that - ppp_register_channel() is called until after the call to - ppp_unregister_channel() returns. - -* No thread may be in a call to any of ppp_input(), ppp_input_error(), - ppp_output_wakeup(), ppp_channel_index() or ppp_unit_number() for a - channel at the time that ppp_unregister_channel() is called for that - channel. - -* ppp_register_channel() and ppp_unregister_channel() must be called - from process context, not interrupt or softirq/BH context. - -* The remaining generic layer functions may be called at softirq/BH - level but must not be called from a hardware interrupt handler. - -* The generic layer may call the channel start_xmit() function at - softirq/BH level but will not call it at interrupt level. Thus the - start_xmit() function may not block. - -* The generic layer will only call the channel ioctl() function in - process context. - -The generic layer provides these guarantees to the channels: - -* The generic layer will not call the start_xmit() function for a - channel while any thread is already executing in that function for - that channel. - -* The generic layer will not call the ioctl() function for a channel - while any thread is already executing in that function for that - channel. - -* By the time a call to ppp_unregister_channel() returns, no thread - will be executing in a call from the generic layer to that channel's - start_xmit() or ioctl() function, and the generic layer will not - call either of those functions subsequently. - - -Interface to pppd ------------------ - -The PPP generic layer exports a character device interface called -/dev/ppp. This is used by pppd to control PPP interface units and -channels. Although there is only one /dev/ppp, each open instance of -/dev/ppp acts independently and can be attached either to a PPP unit -or a PPP channel. This is achieved using the file->private_data field -to point to a separate object for each open instance of /dev/ppp. In -this way an effect similar to Solaris' clone open is obtained, -allowing us to control an arbitrary number of PPP interfaces and -channels without having to fill up /dev with hundreds of device names. - -When /dev/ppp is opened, a new instance is created which is initially -unattached. Using an ioctl call, it can then be attached to an -existing unit, attached to a newly-created unit, or attached to an -existing channel. An instance attached to a unit can be used to send -and receive PPP control frames, using the read() and write() system -calls, along with poll() if necessary. Similarly, an instance -attached to a channel can be used to send and receive PPP frames on -that channel. - -In multilink terms, the unit represents the bundle, while the channels -represent the individual physical links. Thus, a PPP frame sent by a -write to the unit (i.e., to an instance of /dev/ppp attached to the -unit) will be subject to bundle-level compression and to fragmentation -across the individual links (if multilink is in use). In contrast, a -PPP frame sent by a write to the channel will be sent as-is on that -channel, without any multilink header. - -A channel is not initially attached to any unit. In this state it can -be used for PPP negotiation but not for the transfer of data packets. -It can then be connected to a PPP unit with an ioctl call, which -makes it available to send and receive data packets for that unit. - -The ioctl calls which are available on an instance of /dev/ppp depend -on whether it is unattached, attached to a PPP interface, or attached -to a PPP channel. The ioctl calls which are available on an -unattached instance are: - -* PPPIOCNEWUNIT creates a new PPP interface and makes this /dev/ppp - instance the "owner" of the interface. The argument should point to - an int which is the desired unit number if >= 0, or -1 to assign the - lowest unused unit number. Being the owner of the interface means - that the interface will be shut down if this instance of /dev/ppp is - closed. - -* PPPIOCATTACH attaches this instance to an existing PPP interface. - The argument should point to an int containing the unit number. - This does not make this instance the owner of the PPP interface. - -* PPPIOCATTCHAN attaches this instance to an existing PPP channel. - The argument should point to an int containing the channel number. - -The ioctl calls available on an instance of /dev/ppp attached to a -channel are: - -* PPPIOCCONNECT connects this channel to a PPP interface. The - argument should point to an int containing the interface unit - number. It will return an EINVAL error if the channel is already - connected to an interface, or ENXIO if the requested interface does - not exist. - -* PPPIOCDISCONN disconnects this channel from the PPP interface that - it is connected to. It will return an EINVAL error if the channel - is not connected to an interface. - -* All other ioctl commands are passed to the channel ioctl() function. - -The ioctl calls that are available on an instance that is attached to -an interface unit are: - -* PPPIOCSMRU sets the MRU (maximum receive unit) for the interface. - The argument should point to an int containing the new MRU value. - -* PPPIOCSFLAGS sets flags which control the operation of the - interface. The argument should be a pointer to an int containing - the new flags value. The bits in the flags value that can be set - are: - SC_COMP_TCP enable transmit TCP header compression - SC_NO_TCP_CCID disable connection-id compression for - TCP header compression - SC_REJ_COMP_TCP disable receive TCP header decompression - SC_CCP_OPEN Compression Control Protocol (CCP) is - open, so inspect CCP packets - SC_CCP_UP CCP is up, may (de)compress packets - SC_LOOP_TRAFFIC send IP traffic to pppd - SC_MULTILINK enable PPP multilink fragmentation on - transmitted packets - SC_MP_SHORTSEQ expect short multilink sequence - numbers on received multilink fragments - SC_MP_XSHORTSEQ transmit short multilink sequence nos. - - The values of these flags are defined in . Note - that the values of the SC_MULTILINK, SC_MP_SHORTSEQ and - SC_MP_XSHORTSEQ bits are ignored if the CONFIG_PPP_MULTILINK option - is not selected. - -* PPPIOCGFLAGS returns the value of the status/control flags for the - interface unit. The argument should point to an int where the ioctl - will store the flags value. As well as the values listed above for - PPPIOCSFLAGS, the following bits may be set in the returned value: - SC_COMP_RUN CCP compressor is running - SC_DECOMP_RUN CCP decompressor is running - SC_DC_ERROR CCP decompressor detected non-fatal error - SC_DC_FERROR CCP decompressor detected fatal error - -* PPPIOCSCOMPRESS sets the parameters for packet compression or - decompression. The argument should point to a ppp_option_data - structure (defined in ), which contains a - pointer/length pair which should describe a block of memory - containing a CCP option specifying a compression method and its - parameters. The ppp_option_data struct also contains a `transmit' - field. If this is 0, the ioctl will affect the receive path, - otherwise the transmit path. - -* PPPIOCGUNIT returns, in the int pointed to by the argument, the unit - number of this interface unit. - -* PPPIOCSDEBUG sets the debug flags for the interface to the value in - the int pointed to by the argument. Only the least significant bit - is used; if this is 1 the generic layer will print some debug - messages during its operation. This is only intended for debugging - the generic PPP layer code; it is generally not helpful for working - out why a PPP connection is failing. - -* PPPIOCGDEBUG returns the debug flags for the interface in the int - pointed to by the argument. - -* PPPIOCGIDLE returns the time, in seconds, since the last data - packets were sent and received. The argument should point to a - ppp_idle structure (defined in ). If the - CONFIG_PPP_FILTER option is enabled, the set of packets which reset - the transmit and receive idle timers is restricted to those which - pass the `active' packet filter. - Two versions of this command exist, to deal with user space - expecting times as either 32-bit or 64-bit time_t seconds. - -* PPPIOCSMAXCID sets the maximum connection-ID parameter (and thus the - number of connection slots) for the TCP header compressor and - decompressor. The lower 16 bits of the int pointed to by the - argument specify the maximum connection-ID for the compressor. If - the upper 16 bits of that int are non-zero, they specify the maximum - connection-ID for the decompressor, otherwise the decompressor's - maximum connection-ID is set to 15. - -* PPPIOCSNPMODE sets the network-protocol mode for a given network - protocol. The argument should point to an npioctl struct (defined - in ). The `protocol' field gives the PPP protocol - number for the protocol to be affected, and the `mode' field - specifies what to do with packets for that protocol: - - NPMODE_PASS normal operation, transmit and receive packets - NPMODE_DROP silently drop packets for this protocol - NPMODE_ERROR drop packets and return an error on transmit - NPMODE_QUEUE queue up packets for transmit, drop received - packets - - At present NPMODE_ERROR and NPMODE_QUEUE have the same effect as - NPMODE_DROP. - -* PPPIOCGNPMODE returns the network-protocol mode for a given - protocol. The argument should point to an npioctl struct with the - `protocol' field set to the PPP protocol number for the protocol of - interest. On return the `mode' field will be set to the network- - protocol mode for that protocol. - -* PPPIOCSPASS and PPPIOCSACTIVE set the `pass' and `active' packet - filters. These ioctls are only available if the CONFIG_PPP_FILTER - option is selected. The argument should point to a sock_fprog - structure (defined in ) containing the compiled BPF - instructions for the filter. Packets are dropped if they fail the - `pass' filter; otherwise, if they fail the `active' filter they are - passed but they do not reset the transmit or receive idle timer. - -* PPPIOCSMRRU enables or disables multilink processing for received - packets and sets the multilink MRRU (maximum reconstructed receive - unit). The argument should point to an int containing the new MRRU - value. If the MRRU value is 0, processing of received multilink - fragments is disabled. This ioctl is only available if the - CONFIG_PPP_MULTILINK option is selected. - -Last modified: 7-feb-2002 -- cgit v1.2.3 From 832619012c972dc1ca0723ad66ffff6e6e4cf5e0 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:16 +0200 Subject: docs: networking: convert proc_net_tcp.txt to ReST - add SPDX header; - add a document title; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/proc_net_tcp.rst | 57 +++++++++++++++++++++++++++++++ Documentation/networking/proc_net_tcp.txt | 48 -------------------------- 3 files changed, 58 insertions(+), 48 deletions(-) create mode 100644 Documentation/networking/proc_net_tcp.rst delete mode 100644 Documentation/networking/proc_net_tcp.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index f89535871481..0da7eb0ec85a 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -94,6 +94,7 @@ Contents: pktgen plip ppp_generic + proc_net_tcp .. only:: subproject and html diff --git a/Documentation/networking/proc_net_tcp.rst b/Documentation/networking/proc_net_tcp.rst new file mode 100644 index 000000000000..7d9dfe36af45 --- /dev/null +++ b/Documentation/networking/proc_net_tcp.rst @@ -0,0 +1,57 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================ +The proc/net/tcp and proc/net/tcp6 variables +============================================ + +This document describes the interfaces /proc/net/tcp and /proc/net/tcp6. +Note that these interfaces are deprecated in favor of tcp_diag. + +These /proc interfaces provide information about currently active TCP +connections, and are implemented by tcp4_seq_show() in net/ipv4/tcp_ipv4.c +and tcp6_seq_show() in net/ipv6/tcp_ipv6.c, respectively. + +It will first list all listening TCP sockets, and next list all established +TCP connections. A typical entry of /proc/net/tcp would look like this (split +up into 3 parts because of the length of the line):: + + 46: 010310AC:9C4C 030310AC:1770 01 + | | | | | |--> connection state + | | | | |------> remote TCP port number + | | | |-------------> remote IPv4 address + | | |--------------------> local TCP port number + | |---------------------------> local IPv4 address + |----------------------------------> number of entry + + 00000150:00000000 01:00000019 00000000 + | | | | |--> number of unrecovered RTO timeouts + | | | |----------> number of jiffies until timer expires + | | |----------------> timer_active (see below) + | |----------------------> receive-queue + |-------------------------------> transmit-queue + + 1000 0 54165785 4 cd1e6040 25 4 27 3 -1 + | | | | | | | | | |--> slow start size threshold, + | | | | | | | | | or -1 if the threshold + | | | | | | | | | is >= 0xFFFF + | | | | | | | | |----> sending congestion window + | | | | | | | |-------> (ack.quick<<1)|ack.pingpong + | | | | | | |---------> Predicted tick of soft clock + | | | | | | (delayed ACK control data) + | | | | | |------------> retransmit timeout + | | | | |------------------> location of socket in memory + | | | |-----------------------> socket reference count + | | |-----------------------------> inode + | |----------------------------------> unanswered 0-window probes + |---------------------------------------------> uid + +timer_active: + + == ================================================================ + 0 no timer is pending + 1 retransmit-timer is pending + 2 another timer (e.g. delayed ack or keepalive) is pending + 3 this is a socket in TIME_WAIT state. Not all fields will contain + data (or even exist) + 4 zero window probe timer is pending + == ================================================================ diff --git a/Documentation/networking/proc_net_tcp.txt b/Documentation/networking/proc_net_tcp.txt deleted file mode 100644 index 4a79209e77a7..000000000000 --- a/Documentation/networking/proc_net_tcp.txt +++ /dev/null @@ -1,48 +0,0 @@ -This document describes the interfaces /proc/net/tcp and /proc/net/tcp6. -Note that these interfaces are deprecated in favor of tcp_diag. - -These /proc interfaces provide information about currently active TCP -connections, and are implemented by tcp4_seq_show() in net/ipv4/tcp_ipv4.c -and tcp6_seq_show() in net/ipv6/tcp_ipv6.c, respectively. - -It will first list all listening TCP sockets, and next list all established -TCP connections. A typical entry of /proc/net/tcp would look like this (split -up into 3 parts because of the length of the line): - - 46: 010310AC:9C4C 030310AC:1770 01 - | | | | | |--> connection state - | | | | |------> remote TCP port number - | | | |-------------> remote IPv4 address - | | |--------------------> local TCP port number - | |---------------------------> local IPv4 address - |----------------------------------> number of entry - - 00000150:00000000 01:00000019 00000000 - | | | | |--> number of unrecovered RTO timeouts - | | | |----------> number of jiffies until timer expires - | | |----------------> timer_active (see below) - | |----------------------> receive-queue - |-------------------------------> transmit-queue - - 1000 0 54165785 4 cd1e6040 25 4 27 3 -1 - | | | | | | | | | |--> slow start size threshold, - | | | | | | | | | or -1 if the threshold - | | | | | | | | | is >= 0xFFFF - | | | | | | | | |----> sending congestion window - | | | | | | | |-------> (ack.quick<<1)|ack.pingpong - | | | | | | |---------> Predicted tick of soft clock - | | | | | | (delayed ACK control data) - | | | | | |------------> retransmit timeout - | | | | |------------------> location of socket in memory - | | | |-----------------------> socket reference count - | | |-----------------------------> inode - | |----------------------------------> unanswered 0-window probes - |---------------------------------------------> uid - -timer_active: - 0 no timer is pending - 1 retransmit-timer is pending - 2 another timer (e.g. delayed ack or keepalive) is pending - 3 this is a socket in TIME_WAIT state. Not all fields will contain - data (or even exist) - 4 zero window probe timer is pending -- cgit v1.2.3 From 66d495d0a5aecd1692f6b5e3190de14f9a31e14b Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:17 +0200 Subject: docs: networking: convert radiotap-headers.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/mac80211-injection.rst | 2 +- Documentation/networking/radiotap-headers.rst | 159 ++++++++++++++++++++++++ Documentation/networking/radiotap-headers.txt | 152 ---------------------- include/net/cfg80211.h | 2 +- net/wireless/radiotap.c | 2 +- 6 files changed, 163 insertions(+), 155 deletions(-) create mode 100644 Documentation/networking/radiotap-headers.rst delete mode 100644 Documentation/networking/radiotap-headers.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 0da7eb0ec85a..85bc52d0b3a6 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -95,6 +95,7 @@ Contents: plip ppp_generic proc_net_tcp + radiotap-headers .. only:: subproject and html diff --git a/Documentation/networking/mac80211-injection.rst b/Documentation/networking/mac80211-injection.rst index 75d4edcae852..be65f886ff1f 100644 --- a/Documentation/networking/mac80211-injection.rst +++ b/Documentation/networking/mac80211-injection.rst @@ -13,7 +13,7 @@ following format:: [ payload ] The radiotap format is discussed in -./Documentation/networking/radiotap-headers.txt. +./Documentation/networking/radiotap-headers.rst. Despite many radiotap parameters being currently defined, most only make sense to appear on received packets. The following information is parsed from the diff --git a/Documentation/networking/radiotap-headers.rst b/Documentation/networking/radiotap-headers.rst new file mode 100644 index 000000000000..1a1bd1ec0650 --- /dev/null +++ b/Documentation/networking/radiotap-headers.rst @@ -0,0 +1,159 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================== +How to use radiotap headers +=========================== + +Pointer to the radiotap include file +------------------------------------ + +Radiotap headers are variable-length and extensible, you can get most of the +information you need to know on them from:: + + ./include/net/ieee80211_radiotap.h + +This document gives an overview and warns on some corner cases. + + +Structure of the header +----------------------- + +There is a fixed portion at the start which contains a u32 bitmap that defines +if the possible argument associated with that bit is present or not. So if b0 +of the it_present member of ieee80211_radiotap_header is set, it means that +the header for argument index 0 (IEEE80211_RADIOTAP_TSFT) is present in the +argument area. + +:: + + < 8-byte ieee80211_radiotap_header > + [ ] + [ ... ] + +At the moment there are only 13 possible argument indexes defined, but in case +we run out of space in the u32 it_present member, it is defined that b31 set +indicates that there is another u32 bitmap following (shown as "possible +argument bitmap extensions..." above), and the start of the arguments is moved +forward 4 bytes each time. + +Note also that the it_len member __le16 is set to the total number of bytes +covered by the ieee80211_radiotap_header and any arguments following. + + +Requirements for arguments +-------------------------- + +After the fixed part of the header, the arguments follow for each argument +index whose matching bit is set in the it_present member of +ieee80211_radiotap_header. + + - the arguments are all stored little-endian! + + - the argument payload for a given argument index has a fixed size. So + IEEE80211_RADIOTAP_TSFT being present always indicates an 8-byte argument is + present. See the comments in ./include/net/ieee80211_radiotap.h for a nice + breakdown of all the argument sizes + + - the arguments must be aligned to a boundary of the argument size using + padding. So a u16 argument must start on the next u16 boundary if it isn't + already on one, a u32 must start on the next u32 boundary and so on. + + - "alignment" is relative to the start of the ieee80211_radiotap_header, ie, + the first byte of the radiotap header. The absolute alignment of that first + byte isn't defined. So even if the whole radiotap header is starting at, eg, + address 0x00000003, still the first byte of the radiotap header is treated as + 0 for alignment purposes. + + - the above point that there may be no absolute alignment for multibyte + entities in the fixed radiotap header or the argument region means that you + have to take special evasive action when trying to access these multibyte + entities. Some arches like Blackfin cannot deal with an attempt to + dereference, eg, a u16 pointer that is pointing to an odd address. Instead + you have to use a kernel API get_unaligned() to dereference the pointer, + which will do it bytewise on the arches that require that. + + - The arguments for a given argument index can be a compound of multiple types + together. For example IEEE80211_RADIOTAP_CHANNEL has an argument payload + consisting of two u16s of total length 4. When this happens, the padding + rule is applied dealing with a u16, NOT dealing with a 4-byte single entity. + + +Example valid radiotap header +----------------------------- + +:: + + 0x00, 0x00, // <-- radiotap version + pad byte + 0x0b, 0x00, // <- radiotap header length + 0x04, 0x0c, 0x00, 0x00, // <-- bitmap + 0x6c, // <-- rate (in 500kHz units) + 0x0c, //<-- tx power + 0x01 //<-- antenna + + +Using the Radiotap Parser +------------------------- + +If you are having to parse a radiotap struct, you can radically simplify the +job by using the radiotap parser that lives in net/wireless/radiotap.c and has +its prototypes available in include/net/cfg80211.h. You use it like this:: + + #include + + /* buf points to the start of the radiotap header part */ + + int MyFunction(u8 * buf, int buflen) + { + int pkt_rate_100kHz = 0, antenna = 0, pwr = 0; + struct ieee80211_radiotap_iterator iterator; + int ret = ieee80211_radiotap_iterator_init(&iterator, buf, buflen); + + while (!ret) { + + ret = ieee80211_radiotap_iterator_next(&iterator); + + if (ret) + continue; + + /* see if this argument is something we can use */ + + switch (iterator.this_arg_index) { + /* + * You must take care when dereferencing iterator.this_arg + * for multibyte types... the pointer is not aligned. Use + * get_unaligned((type *)iterator.this_arg) to dereference + * iterator.this_arg for type "type" safely on all arches. + */ + case IEEE80211_RADIOTAP_RATE: + /* radiotap "rate" u8 is in + * 500kbps units, eg, 0x02=1Mbps + */ + pkt_rate_100kHz = (*iterator.this_arg) * 5; + break; + + case IEEE80211_RADIOTAP_ANTENNA: + /* radiotap uses 0 for 1st ant */ + antenna = *iterator.this_arg); + break; + + case IEEE80211_RADIOTAP_DBM_TX_POWER: + pwr = *iterator.this_arg; + break; + + default: + break; + } + } /* while more rt headers */ + + if (ret != -ENOENT) + return TXRX_DROP; + + /* discard the radiotap header part */ + buf += iterator.max_length; + buflen -= iterator.max_length; + + ... + + } + +Andy Green diff --git a/Documentation/networking/radiotap-headers.txt b/Documentation/networking/radiotap-headers.txt deleted file mode 100644 index 953331c7984f..000000000000 --- a/Documentation/networking/radiotap-headers.txt +++ /dev/null @@ -1,152 +0,0 @@ -How to use radiotap headers -=========================== - -Pointer to the radiotap include file ------------------------------------- - -Radiotap headers are variable-length and extensible, you can get most of the -information you need to know on them from: - -./include/net/ieee80211_radiotap.h - -This document gives an overview and warns on some corner cases. - - -Structure of the header ------------------------ - -There is a fixed portion at the start which contains a u32 bitmap that defines -if the possible argument associated with that bit is present or not. So if b0 -of the it_present member of ieee80211_radiotap_header is set, it means that -the header for argument index 0 (IEEE80211_RADIOTAP_TSFT) is present in the -argument area. - - < 8-byte ieee80211_radiotap_header > - [ ] - [ ... ] - -At the moment there are only 13 possible argument indexes defined, but in case -we run out of space in the u32 it_present member, it is defined that b31 set -indicates that there is another u32 bitmap following (shown as "possible -argument bitmap extensions..." above), and the start of the arguments is moved -forward 4 bytes each time. - -Note also that the it_len member __le16 is set to the total number of bytes -covered by the ieee80211_radiotap_header and any arguments following. - - -Requirements for arguments --------------------------- - -After the fixed part of the header, the arguments follow for each argument -index whose matching bit is set in the it_present member of -ieee80211_radiotap_header. - - - the arguments are all stored little-endian! - - - the argument payload for a given argument index has a fixed size. So - IEEE80211_RADIOTAP_TSFT being present always indicates an 8-byte argument is - present. See the comments in ./include/net/ieee80211_radiotap.h for a nice - breakdown of all the argument sizes - - - the arguments must be aligned to a boundary of the argument size using - padding. So a u16 argument must start on the next u16 boundary if it isn't - already on one, a u32 must start on the next u32 boundary and so on. - - - "alignment" is relative to the start of the ieee80211_radiotap_header, ie, - the first byte of the radiotap header. The absolute alignment of that first - byte isn't defined. So even if the whole radiotap header is starting at, eg, - address 0x00000003, still the first byte of the radiotap header is treated as - 0 for alignment purposes. - - - the above point that there may be no absolute alignment for multibyte - entities in the fixed radiotap header or the argument region means that you - have to take special evasive action when trying to access these multibyte - entities. Some arches like Blackfin cannot deal with an attempt to - dereference, eg, a u16 pointer that is pointing to an odd address. Instead - you have to use a kernel API get_unaligned() to dereference the pointer, - which will do it bytewise on the arches that require that. - - - The arguments for a given argument index can be a compound of multiple types - together. For example IEEE80211_RADIOTAP_CHANNEL has an argument payload - consisting of two u16s of total length 4. When this happens, the padding - rule is applied dealing with a u16, NOT dealing with a 4-byte single entity. - - -Example valid radiotap header ------------------------------ - - 0x00, 0x00, // <-- radiotap version + pad byte - 0x0b, 0x00, // <- radiotap header length - 0x04, 0x0c, 0x00, 0x00, // <-- bitmap - 0x6c, // <-- rate (in 500kHz units) - 0x0c, //<-- tx power - 0x01 //<-- antenna - - -Using the Radiotap Parser -------------------------- - -If you are having to parse a radiotap struct, you can radically simplify the -job by using the radiotap parser that lives in net/wireless/radiotap.c and has -its prototypes available in include/net/cfg80211.h. You use it like this: - -#include - -/* buf points to the start of the radiotap header part */ - -int MyFunction(u8 * buf, int buflen) -{ - int pkt_rate_100kHz = 0, antenna = 0, pwr = 0; - struct ieee80211_radiotap_iterator iterator; - int ret = ieee80211_radiotap_iterator_init(&iterator, buf, buflen); - - while (!ret) { - - ret = ieee80211_radiotap_iterator_next(&iterator); - - if (ret) - continue; - - /* see if this argument is something we can use */ - - switch (iterator.this_arg_index) { - /* - * You must take care when dereferencing iterator.this_arg - * for multibyte types... the pointer is not aligned. Use - * get_unaligned((type *)iterator.this_arg) to dereference - * iterator.this_arg for type "type" safely on all arches. - */ - case IEEE80211_RADIOTAP_RATE: - /* radiotap "rate" u8 is in - * 500kbps units, eg, 0x02=1Mbps - */ - pkt_rate_100kHz = (*iterator.this_arg) * 5; - break; - - case IEEE80211_RADIOTAP_ANTENNA: - /* radiotap uses 0 for 1st ant */ - antenna = *iterator.this_arg); - break; - - case IEEE80211_RADIOTAP_DBM_TX_POWER: - pwr = *iterator.this_arg; - break; - - default: - break; - } - } /* while more rt headers */ - - if (ret != -ENOENT) - return TXRX_DROP; - - /* discard the radiotap header part */ - buf += iterator.max_length; - buflen -= iterator.max_length; - - ... - -} - -Andy Green diff --git a/include/net/cfg80211.h b/include/net/cfg80211.h index 70e48f66dac8..46ac80423b28 100644 --- a/include/net/cfg80211.h +++ b/include/net/cfg80211.h @@ -5211,7 +5211,7 @@ u32 ieee80211_mandatory_rates(struct ieee80211_supported_band *sband, * Radiotap parsing functions -- for controlled injection support * * Implemented in net/wireless/radiotap.c - * Documentation in Documentation/networking/radiotap-headers.txt + * Documentation in Documentation/networking/radiotap-headers.rst */ struct radiotap_align_size { diff --git a/net/wireless/radiotap.c b/net/wireless/radiotap.c index 6582d155e2fc..d5e28239e030 100644 --- a/net/wireless/radiotap.c +++ b/net/wireless/radiotap.c @@ -90,7 +90,7 @@ static const struct ieee80211_radiotap_namespace radiotap_ns = { * iterator.this_arg for type "type" safely on all arches. * * Example code: - * See Documentation/networking/radiotap-headers.txt + * See Documentation/networking/radiotap-headers.rst */ int ieee80211_radiotap_iterator_init( -- cgit v1.2.3 From 8c6e172002987202c17756e06e84c7e4d8916c44 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:18 +0200 Subject: docs: networking: convert ray_cs.txt to ReST - add SPDX header; - use copyright symbol; - add a document title; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/ray_cs.rst | 165 ++++++++++++++++++++++++++++++++++++ Documentation/networking/ray_cs.txt | 150 -------------------------------- drivers/net/wireless/Kconfig | 2 +- 4 files changed, 167 insertions(+), 151 deletions(-) create mode 100644 Documentation/networking/ray_cs.rst delete mode 100644 Documentation/networking/ray_cs.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 85bc52d0b3a6..b7e35b0d905c 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -96,6 +96,7 @@ Contents: ppp_generic proc_net_tcp radiotap-headers + ray_cs .. only:: subproject and html diff --git a/Documentation/networking/ray_cs.rst b/Documentation/networking/ray_cs.rst new file mode 100644 index 000000000000..9a46d1ae8f20 --- /dev/null +++ b/Documentation/networking/ray_cs.rst @@ -0,0 +1,165 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. include:: + +========================= +Raylink wireless LAN card +========================= + +September 21, 1999 + +Copyright |copy| 1998 Corey Thomas (corey@world.std.com) + +This file is the documentation for the Raylink Wireless LAN card driver for +Linux. The Raylink wireless LAN card is a PCMCIA card which provides IEEE +802.11 compatible wireless network connectivity at 1 and 2 megabits/second. +See http://www.raytheon.com/micro/raylink/ for more information on the Raylink +card. This driver is in early development and does have bugs. See the known +bugs and limitations at the end of this document for more information. +This driver also works with WebGear's Aviator 2.4 and Aviator Pro +wireless LAN cards. + +As of kernel 2.3.18, the ray_cs driver is part of the Linux kernel +source. My web page for the development of ray_cs is at +http://web.ralinktech.com/ralink/Home/Support/Linux.html +and I can be emailed at corey@world.std.com + +The kernel driver is based on ray_cs-1.62.tgz + +The driver at my web page is intended to be used as an add on to +David Hinds pcmcia package. All the command line parameters are +available when compiled as a module. When built into the kernel, only +the essid= string parameter is available via the kernel command line. +This will change after the method of sorting out parameters for all +the PCMCIA drivers is agreed upon. If you must have a built in driver +with nondefault parameters, they can be edited in +/usr/src/linux/drivers/net/pcmcia/ray_cs.c. Searching for module_param +will find them all. + +Information on card services is available at: + + http://pcmcia-cs.sourceforge.net/ + + +Card services user programs are still required for PCMCIA devices. +pcmcia-cs-3.1.1 or greater is required for the kernel version of +the driver. + +Currently, ray_cs is not part of David Hinds card services package, +so the following magic is required. + +At the end of the /etc/pcmcia/config.opts file, add the line: +source ./ray_cs.opts +This will make card services read the ray_cs.opts file +when starting. Create the file /etc/pcmcia/ray_cs.opts containing the +following:: + + #### start of /etc/pcmcia/ray_cs.opts ################### + # Configuration options for Raylink Wireless LAN PCMCIA card + device "ray_cs" + class "network" module "misc/ray_cs" + + card "RayLink PC Card WLAN Adapter" + manfid 0x01a6, 0x0000 + bind "ray_cs" + + module "misc/ray_cs" opts "" + #### end of /etc/pcmcia/ray_cs.opts ##################### + + +To join an existing network with +different parameters, contact the network administrator for the +configuration information, and edit /etc/pcmcia/ray_cs.opts. +Add the parameters below between the empty quotes. + +Parameters for ray_cs driver which may be specified in ray_cs.opts: + +=============== =============== ============================================= +bc integer 0 = normal mode (802.11 timing), + 1 = slow down inter frame timing to allow + operation with older breezecom access + points. + +beacon_period integer beacon period in Kilo-microseconds, + + legal values = must be integer multiple + of hop dwell + + default = 256 + +country integer 1 = USA (default), + 2 = Europe, + 3 = Japan, + 4 = Korea, + 5 = Spain, + 6 = France, + 7 = Israel, + 8 = Australia + +essid string ESS ID - network name to join + + string with maximum length of 32 chars + default value = "ADHOC_ESSID" + +hop_dwell integer hop dwell time in Kilo-microseconds + + legal values = 16,32,64,128(default),256 + +irq_mask integer linux standard 16 bit value 1bit/IRQ + + lsb is IRQ 0, bit 1 is IRQ 1 etc. + Used to restrict choice of IRQ's to use. + Recommended method for controlling + interrupts is in /etc/pcmcia/config.opts + +net_type integer 0 (default) = adhoc network, + 1 = infrastructure + +phy_addr string string containing new MAC address in + hex, must start with x eg + x00008f123456 + +psm integer 0 = continuously active, + 1 = power save mode (not useful yet) + +pc_debug integer (0-5) larger values for more verbose + logging. Replaces ray_debug. + +ray_debug integer Replaced with pc_debug + +ray_mem_speed integer defaults to 500 + +sniffer integer 0 = not sniffer (default), + 1 = sniffer which can be used to record all + network traffic using tcpdump or similar, + but no normal network use is allowed. + +translate integer 0 = no translation (encapsulate frames), + 1 = translation (RFC1042/802.1) +=============== =============== ============================================= + +More on sniffer mode: + +tcpdump does not understand 802.11 headers, so it can't +interpret the contents, but it can record to a file. This is only +useful for debugging 802.11 lowlevel protocols that are not visible to +linux. If you want to watch ftp xfers, or do similar things, you +don't need to use sniffer mode. Also, some packet types are never +sent up by the card, so you will never see them (ack, rts, cts, probe +etc.) There is a simple program (showcap) included in the ray_cs +package which parses the 802.11 headers. + +Known Problems and missing features + + Does not work with non x86 + + Does not work with SMP + + Support for defragmenting frames is not yet debugged, and in + fact is known to not work. I have never encountered a net set + up to fragment, but still, it should be fixed. + + The ioctl support is incomplete. The hardware address cannot be set + using ifconfig yet. If a different hardware address is needed, it may + be set using the phy_addr parameter in ray_cs.opts. This requires + a card insertion to take effect. diff --git a/Documentation/networking/ray_cs.txt b/Documentation/networking/ray_cs.txt deleted file mode 100644 index c0c12307ed9d..000000000000 --- a/Documentation/networking/ray_cs.txt +++ /dev/null @@ -1,150 +0,0 @@ -September 21, 1999 - -Copyright (c) 1998 Corey Thomas (corey@world.std.com) - -This file is the documentation for the Raylink Wireless LAN card driver for -Linux. The Raylink wireless LAN card is a PCMCIA card which provides IEEE -802.11 compatible wireless network connectivity at 1 and 2 megabits/second. -See http://www.raytheon.com/micro/raylink/ for more information on the Raylink -card. This driver is in early development and does have bugs. See the known -bugs and limitations at the end of this document for more information. -This driver also works with WebGear's Aviator 2.4 and Aviator Pro -wireless LAN cards. - -As of kernel 2.3.18, the ray_cs driver is part of the Linux kernel -source. My web page for the development of ray_cs is at -http://web.ralinktech.com/ralink/Home/Support/Linux.html -and I can be emailed at corey@world.std.com - -The kernel driver is based on ray_cs-1.62.tgz - -The driver at my web page is intended to be used as an add on to -David Hinds pcmcia package. All the command line parameters are -available when compiled as a module. When built into the kernel, only -the essid= string parameter is available via the kernel command line. -This will change after the method of sorting out parameters for all -the PCMCIA drivers is agreed upon. If you must have a built in driver -with nondefault parameters, they can be edited in -/usr/src/linux/drivers/net/pcmcia/ray_cs.c. Searching for module_param -will find them all. - -Information on card services is available at: - http://pcmcia-cs.sourceforge.net/ - - -Card services user programs are still required for PCMCIA devices. -pcmcia-cs-3.1.1 or greater is required for the kernel version of -the driver. - -Currently, ray_cs is not part of David Hinds card services package, -so the following magic is required. - -At the end of the /etc/pcmcia/config.opts file, add the line: -source ./ray_cs.opts -This will make card services read the ray_cs.opts file -when starting. Create the file /etc/pcmcia/ray_cs.opts containing the -following: - -#### start of /etc/pcmcia/ray_cs.opts ################### -# Configuration options for Raylink Wireless LAN PCMCIA card -device "ray_cs" - class "network" module "misc/ray_cs" - -card "RayLink PC Card WLAN Adapter" - manfid 0x01a6, 0x0000 - bind "ray_cs" - -module "misc/ray_cs" opts "" -#### end of /etc/pcmcia/ray_cs.opts ##################### - - -To join an existing network with -different parameters, contact the network administrator for the -configuration information, and edit /etc/pcmcia/ray_cs.opts. -Add the parameters below between the empty quotes. - -Parameters for ray_cs driver which may be specified in ray_cs.opts: - -bc integer 0 = normal mode (802.11 timing) - 1 = slow down inter frame timing to allow - operation with older breezecom access - points. - -beacon_period integer beacon period in Kilo-microseconds - legal values = must be integer multiple - of hop dwell - default = 256 - -country integer 1 = USA (default) - 2 = Europe - 3 = Japan - 4 = Korea - 5 = Spain - 6 = France - 7 = Israel - 8 = Australia - -essid string ESS ID - network name to join - string with maximum length of 32 chars - default value = "ADHOC_ESSID" - -hop_dwell integer hop dwell time in Kilo-microseconds - legal values = 16,32,64,128(default),256 - -irq_mask integer linux standard 16 bit value 1bit/IRQ - lsb is IRQ 0, bit 1 is IRQ 1 etc. - Used to restrict choice of IRQ's to use. - Recommended method for controlling - interrupts is in /etc/pcmcia/config.opts - -net_type integer 0 (default) = adhoc network, - 1 = infrastructure - -phy_addr string string containing new MAC address in - hex, must start with x eg - x00008f123456 - -psm integer 0 = continuously active - 1 = power save mode (not useful yet) - -pc_debug integer (0-5) larger values for more verbose - logging. Replaces ray_debug. - -ray_debug integer Replaced with pc_debug - -ray_mem_speed integer defaults to 500 - -sniffer integer 0 = not sniffer (default) - 1 = sniffer which can be used to record all - network traffic using tcpdump or similar, - but no normal network use is allowed. - -translate integer 0 = no translation (encapsulate frames) - 1 = translation (RFC1042/802.1) - - -More on sniffer mode: - -tcpdump does not understand 802.11 headers, so it can't -interpret the contents, but it can record to a file. This is only -useful for debugging 802.11 lowlevel protocols that are not visible to -linux. If you want to watch ftp xfers, or do similar things, you -don't need to use sniffer mode. Also, some packet types are never -sent up by the card, so you will never see them (ack, rts, cts, probe -etc.) There is a simple program (showcap) included in the ray_cs -package which parses the 802.11 headers. - -Known Problems and missing features - - Does not work with non x86 - - Does not work with SMP - - Support for defragmenting frames is not yet debugged, and in - fact is known to not work. I have never encountered a net set - up to fragment, but still, it should be fixed. - - The ioctl support is incomplete. The hardware address cannot be set - using ifconfig yet. If a different hardware address is needed, it may - be set using the phy_addr parameter in ray_cs.opts. This requires - a card insertion to take effect. diff --git a/drivers/net/wireless/Kconfig b/drivers/net/wireless/Kconfig index 1c98d781ae49..15b0ad171f4c 100644 --- a/drivers/net/wireless/Kconfig +++ b/drivers/net/wireless/Kconfig @@ -57,7 +57,7 @@ config PCMCIA_RAYCS ---help--- Say Y here if you intend to attach an Aviator/Raytheon PCMCIA (PC-card) wireless Ethernet networking card to your computer. - Please read the file for + Please read the file for details. To compile this driver as a module, choose M here: the module will be -- cgit v1.2.3 From bad5b6e223e8409c860c0574d5239ee4348f06b3 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:19 +0200 Subject: docs: networking: convert rds.txt to ReST - add SPDX header; - add a document title; - mark code blocks and literals as such; - mark tables as such; - mark lists as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Santosh Shilimkar Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/rds.rst | 448 +++++++++++++++++++++++++++++++++++++ Documentation/networking/rds.txt | 423 ---------------------------------- MAINTAINERS | 2 +- 4 files changed, 450 insertions(+), 424 deletions(-) create mode 100644 Documentation/networking/rds.rst delete mode 100644 Documentation/networking/rds.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index b7e35b0d905c..e63a2cb2e4cb 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -97,6 +97,7 @@ Contents: proc_net_tcp radiotap-headers ray_cs + rds .. only:: subproject and html diff --git a/Documentation/networking/rds.rst b/Documentation/networking/rds.rst new file mode 100644 index 000000000000..44936c27ab3a --- /dev/null +++ b/Documentation/networking/rds.rst @@ -0,0 +1,448 @@ +.. SPDX-License-Identifier: GPL-2.0 + +== +RDS +=== + +Overview +======== + +This readme tries to provide some background on the hows and whys of RDS, +and will hopefully help you find your way around the code. + +In addition, please see this email about RDS origins: +http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html + +RDS Architecture +================ + +RDS provides reliable, ordered datagram delivery by using a single +reliable connection between any two nodes in the cluster. This allows +applications to use a single socket to talk to any other process in the +cluster - so in a cluster with N processes you need N sockets, in contrast +to N*N if you use a connection-oriented socket transport like TCP. + +RDS is not Infiniband-specific; it was designed to support different +transports. The current implementation used to support RDS over TCP as well +as IB. + +The high-level semantics of RDS from the application's point of view are + + * Addressing + + RDS uses IPv4 addresses and 16bit port numbers to identify + the end point of a connection. All socket operations that involve + passing addresses between kernel and user space generally + use a struct sockaddr_in. + + The fact that IPv4 addresses are used does not mean the underlying + transport has to be IP-based. In fact, RDS over IB uses a + reliable IB connection; the IP address is used exclusively to + locate the remote node's GID (by ARPing for the given IP). + + The port space is entirely independent of UDP, TCP or any other + protocol. + + * Socket interface + + RDS sockets work *mostly* as you would expect from a BSD + socket. The next section will cover the details. At any rate, + all I/O is performed through the standard BSD socket API. + Some additions like zerocopy support are implemented through + control messages, while other extensions use the getsockopt/ + setsockopt calls. + + Sockets must be bound before you can send or receive data. + This is needed because binding also selects a transport and + attaches it to the socket. Once bound, the transport assignment + does not change. RDS will tolerate IPs moving around (eg in + a active-active HA scenario), but only as long as the address + doesn't move to a different transport. + + * sysctls + + RDS supports a number of sysctls in /proc/sys/net/rds + + +Socket Interface +================ + + AF_RDS, PF_RDS, SOL_RDS + AF_RDS and PF_RDS are the domain type to be used with socket(2) + to create RDS sockets. SOL_RDS is the socket-level to be used + with setsockopt(2) and getsockopt(2) for RDS specific socket + options. + + fd = socket(PF_RDS, SOCK_SEQPACKET, 0); + This creates a new, unbound RDS socket. + + setsockopt(SOL_SOCKET): send and receive buffer size + RDS honors the send and receive buffer size socket options. + You are not allowed to queue more than SO_SNDSIZE bytes to + a socket. A message is queued when sendmsg is called, and + it leaves the queue when the remote system acknowledges + its arrival. + + The SO_RCVSIZE option controls the maximum receive queue length. + This is a soft limit rather than a hard limit - RDS will + continue to accept and queue incoming messages, even if that + takes the queue length over the limit. However, it will also + mark the port as "congested" and send a congestion update to + the source node. The source node is supposed to throttle any + processes sending to this congested port. + + bind(fd, &sockaddr_in, ...) + This binds the socket to a local IP address and port, and a + transport, if one has not already been selected via the + SO_RDS_TRANSPORT socket option + + sendmsg(fd, ...) + Sends a message to the indicated recipient. The kernel will + transparently establish the underlying reliable connection + if it isn't up yet. + + An attempt to send a message that exceeds SO_SNDSIZE will + return with -EMSGSIZE + + An attempt to send a message that would take the total number + of queued bytes over the SO_SNDSIZE threshold will return + EAGAIN. + + An attempt to send a message to a destination that is marked + as "congested" will return ENOBUFS. + + recvmsg(fd, ...) + Receives a message that was queued to this socket. The sockets + recv queue accounting is adjusted, and if the queue length + drops below SO_SNDSIZE, the port is marked uncongested, and + a congestion update is sent to all peers. + + Applications can ask the RDS kernel module to receive + notifications via control messages (for instance, there is a + notification when a congestion update arrived, or when a RDMA + operation completes). These notifications are received through + the msg.msg_control buffer of struct msghdr. The format of the + messages is described in manpages. + + poll(fd) + RDS supports the poll interface to allow the application + to implement async I/O. + + POLLIN handling is pretty straightforward. When there's an + incoming message queued to the socket, or a pending notification, + we signal POLLIN. + + POLLOUT is a little harder. Since you can essentially send + to any destination, RDS will always signal POLLOUT as long as + there's room on the send queue (ie the number of bytes queued + is less than the sendbuf size). + + However, the kernel will refuse to accept messages to + a destination marked congested - in this case you will loop + forever if you rely on poll to tell you what to do. + This isn't a trivial problem, but applications can deal with + this - by using congestion notifications, and by checking for + ENOBUFS errors returned by sendmsg. + + setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in) + This allows the application to discard all messages queued to a + specific destination on this particular socket. + + This allows the application to cancel outstanding messages if + it detects a timeout. For instance, if it tried to send a message, + and the remote host is unreachable, RDS will keep trying forever. + The application may decide it's not worth it, and cancel the + operation. In this case, it would use RDS_CANCEL_SENT_TO to + nuke any pending messages. + + ``setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..), getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)`` + Set or read an integer defining the underlying + encapsulating transport to be used for RDS packets on the + socket. When setting the option, integer argument may be + one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the + value, RDS_TRANS_NONE will be returned on an unbound socket. + This socket option may only be set exactly once on the socket, + prior to binding it via the bind(2) system call. Attempts to + set SO_RDS_TRANSPORT on a socket for which the transport has + been previously attached explicitly (by SO_RDS_TRANSPORT) or + implicitly (via bind(2)) will return an error of EOPNOTSUPP. + An attempt to set SO_RDS_TRANSPORT to RDS_TRANS_NONE will + always return EINVAL. + +RDMA for RDS +============ + + see rds-rdma(7) manpage (available in rds-tools) + + +Congestion Notifications +======================== + + see rds(7) manpage + + +RDS Protocol +============ + + Message header + + The message header is a 'struct rds_header' (see rds.h): + + Fields: + + h_sequence: + per-packet sequence number + h_ack: + piggybacked acknowledgment of last packet received + h_len: + length of data, not including header + h_sport: + source port + h_dport: + destination port + h_flags: + Can be: + + ============= ================================== + CONG_BITMAP this is a congestion update bitmap + ACK_REQUIRED receiver must ack this packet + RETRANSMITTED packet has previously been sent + ============= ================================== + + h_credit: + indicate to other end of connection that + it has more credits available (i.e. there is + more send room) + h_padding[4]: + unused, for future use + h_csum: + header checksum + h_exthdr: + optional data can be passed here. This is currently used for + passing RDMA-related information. + + ACK and retransmit handling + + One might think that with reliable IB connections you wouldn't need + to ack messages that have been received. The problem is that IB + hardware generates an ack message before it has DMAed the message + into memory. This creates a potential message loss if the HCA is + disabled for any reason between when it sends the ack and before + the message is DMAed and processed. This is only a potential issue + if another HCA is available for fail-over. + + Sending an ack immediately would allow the sender to free the sent + message from their send queue quickly, but could cause excessive + traffic to be used for acks. RDS piggybacks acks on sent data + packets. Ack-only packets are reduced by only allowing one to be + in flight at a time, and by the sender only asking for acks when + its send buffers start to fill up. All retransmissions are also + acked. + + Flow Control + + RDS's IB transport uses a credit-based mechanism to verify that + there is space in the peer's receive buffers for more data. This + eliminates the need for hardware retries on the connection. + + Congestion + + Messages waiting in the receive queue on the receiving socket + are accounted against the sockets SO_RCVBUF option value. Only + the payload bytes in the message are accounted for. If the + number of bytes queued equals or exceeds rcvbuf then the socket + is congested. All sends attempted to this socket's address + should return block or return -EWOULDBLOCK. + + Applications are expected to be reasonably tuned such that this + situation very rarely occurs. An application encountering this + "back-pressure" is considered a bug. + + This is implemented by having each node maintain bitmaps which + indicate which ports on bound addresses are congested. As the + bitmap changes it is sent through all the connections which + terminate in the local address of the bitmap which changed. + + The bitmaps are allocated as connections are brought up. This + avoids allocation in the interrupt handling path which queues + sages on sockets. The dense bitmaps let transports send the + entire bitmap on any bitmap change reasonably efficiently. This + is much easier to implement than some finer-grained + communication of per-port congestion. The sender does a very + inexpensive bit test to test if the port it's about to send to + is congested or not. + + +RDS Transport Layer +=================== + + As mentioned above, RDS is not IB-specific. Its code is divided + into a general RDS layer and a transport layer. + + The general layer handles the socket API, congestion handling, + loopback, stats, usermem pinning, and the connection state machine. + + The transport layer handles the details of the transport. The IB + transport, for example, handles all the queue pairs, work requests, + CM event handlers, and other Infiniband details. + + +RDS Kernel Structures +===================== + + struct rds_message + aka possibly "rds_outgoing", the generic RDS layer copies data to + be sent and sets header fields as needed, based on the socket API. + This is then queued for the individual connection and sent by the + connection's transport. + + struct rds_incoming + a generic struct referring to incoming data that can be handed from + the transport to the general code and queued by the general code + while the socket is awoken. It is then passed back to the transport + code to handle the actual copy-to-user. + + struct rds_socket + per-socket information + + struct rds_connection + per-connection information + + struct rds_transport + pointers to transport-specific functions + + struct rds_statistics + non-transport-specific statistics + + struct rds_cong_map + wraps the raw congestion bitmap, contains rbnode, waitq, etc. + +Connection management +===================== + + Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and + ERROR states. + + The first time an attempt is made by an RDS socket to send data to + a node, a connection is allocated and connected. That connection is + then maintained forever -- if there are transport errors, the + connection will be dropped and re-established. + + Dropping a connection while packets are queued will cause queued or + partially-sent datagrams to be retransmitted when the connection is + re-established. + + +The send path +============= + + rds_sendmsg() + - struct rds_message built from incoming data + - CMSGs parsed (e.g. RDMA ops) + - transport connection alloced and connected if not already + - rds_message placed on send queue + - send worker awoken + + rds_send_worker() + - calls rds_send_xmit() until queue is empty + + rds_send_xmit() + - transmits congestion map if one is pending + - may set ACK_REQUIRED + - calls transport to send either non-RDMA or RDMA message + (RDMA ops never retransmitted) + + rds_ib_xmit() + - allocs work requests from send ring + - adds any new send credits available to peer (h_credits) + - maps the rds_message's sg list + - piggybacks ack + - populates work requests + - post send to connection's queue pair + +The recv path +============= + + rds_ib_recv_cq_comp_handler() + - looks at write completions + - unmaps recv buffer from device + - no errors, call rds_ib_process_recv() + - refill recv ring + + rds_ib_process_recv() + - validate header checksum + - copy header to rds_ib_incoming struct if start of a new datagram + - add to ibinc's fraglist + - if competed datagram: + - update cong map if datagram was cong update + - call rds_recv_incoming() otherwise + - note if ack is required + + rds_recv_incoming() + - drop duplicate packets + - respond to pings + - find the sock associated with this datagram + - add to sock queue + - wake up sock + - do some congestion calculations + rds_recvmsg + - copy data into user iovec + - handle CMSGs + - return to application + +Multipath RDS (mprds) +===================== + Mprds is multipathed-RDS, primarily intended for RDS-over-TCP + (though the concept can be extended to other transports). The classical + implementation of RDS-over-TCP is implemented by demultiplexing multiple + PF_RDS sockets between any 2 endpoints (where endpoint == [IP address, + port]) over a single TCP socket between the 2 IP addresses involved. This + has the limitation that it ends up funneling multiple RDS flows over a + single TCP flow, thus it is + (a) upper-bounded to the single-flow bandwidth, + (b) suffers from head-of-line blocking for all the RDS sockets. + + Better throughput (for a fixed small packet size, MTU) can be achieved + by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed + RDS (mprds). Each such TCP/IP flow constitutes a path for the rds/tcp + connection. RDS sockets will be attached to a path based on some hash + (e.g., of local address and RDS port number) and packets for that RDS + socket will be sent over the attached path using TCP to segment/reassemble + RDS datagrams on that path. + + Multipathed RDS is implemented by splitting the struct rds_connection into + a common (to all paths) part, and a per-path struct rds_conn_path. All + I/O workqs and reconnect threads are driven from the rds_conn_path. + Transports such as TCP that are multipath capable may then set up a + TCP socket per rds_conn_path, and this is managed by the transport via + the transport privatee cp_transport_data pointer. + + Transports announce themselves as multipath capable by setting the + t_mp_capable bit during registration with the rds core module. When the + transport is multipath-capable, rds_sendmsg() hashes outgoing traffic + across multiple paths. The outgoing hash is computed based on the + local address and port that the PF_RDS socket is bound to. + + Additionally, even if the transport is MP capable, we may be + peering with some node that does not support mprds, or supports + a different number of paths. As a result, the peering nodes need + to agree on the number of paths to be used for the connection. + This is done by sending out a control packet exchange before the + first data packet. The control packet exchange must have completed + prior to outgoing hash completion in rds_sendmsg() when the transport + is mutlipath capable. + + The control packet is an RDS ping packet (i.e., packet to rds dest + port 0) with the ping packet having a rds extension header option of + type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the + number of paths supported by the sender. The "probe" ping packet will + get sent from some reserved port, RDS_FLAG_PROBE_PORT (in ) + The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately + be able to compute the min(sender_paths, rcvr_paths). The pong + sent in response to a probe-ping should contain the rcvr's npaths + when the rcvr is mprds-capable. + + If the rcvr is not mprds-capable, the exthdr in the ping will be + ignored. In this case the pong will not have any exthdrs, so the sender + of the probe-ping can default to single-path mprds. + diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt deleted file mode 100644 index eec61694e894..000000000000 --- a/Documentation/networking/rds.txt +++ /dev/null @@ -1,423 +0,0 @@ - -Overview -======== - -This readme tries to provide some background on the hows and whys of RDS, -and will hopefully help you find your way around the code. - -In addition, please see this email about RDS origins: -http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html - -RDS Architecture -================ - -RDS provides reliable, ordered datagram delivery by using a single -reliable connection between any two nodes in the cluster. This allows -applications to use a single socket to talk to any other process in the -cluster - so in a cluster with N processes you need N sockets, in contrast -to N*N if you use a connection-oriented socket transport like TCP. - -RDS is not Infiniband-specific; it was designed to support different -transports. The current implementation used to support RDS over TCP as well -as IB. - -The high-level semantics of RDS from the application's point of view are - - * Addressing - RDS uses IPv4 addresses and 16bit port numbers to identify - the end point of a connection. All socket operations that involve - passing addresses between kernel and user space generally - use a struct sockaddr_in. - - The fact that IPv4 addresses are used does not mean the underlying - transport has to be IP-based. In fact, RDS over IB uses a - reliable IB connection; the IP address is used exclusively to - locate the remote node's GID (by ARPing for the given IP). - - The port space is entirely independent of UDP, TCP or any other - protocol. - - * Socket interface - RDS sockets work *mostly* as you would expect from a BSD - socket. The next section will cover the details. At any rate, - all I/O is performed through the standard BSD socket API. - Some additions like zerocopy support are implemented through - control messages, while other extensions use the getsockopt/ - setsockopt calls. - - Sockets must be bound before you can send or receive data. - This is needed because binding also selects a transport and - attaches it to the socket. Once bound, the transport assignment - does not change. RDS will tolerate IPs moving around (eg in - a active-active HA scenario), but only as long as the address - doesn't move to a different transport. - - * sysctls - RDS supports a number of sysctls in /proc/sys/net/rds - - -Socket Interface -================ - - AF_RDS, PF_RDS, SOL_RDS - AF_RDS and PF_RDS are the domain type to be used with socket(2) - to create RDS sockets. SOL_RDS is the socket-level to be used - with setsockopt(2) and getsockopt(2) for RDS specific socket - options. - - fd = socket(PF_RDS, SOCK_SEQPACKET, 0); - This creates a new, unbound RDS socket. - - setsockopt(SOL_SOCKET): send and receive buffer size - RDS honors the send and receive buffer size socket options. - You are not allowed to queue more than SO_SNDSIZE bytes to - a socket. A message is queued when sendmsg is called, and - it leaves the queue when the remote system acknowledges - its arrival. - - The SO_RCVSIZE option controls the maximum receive queue length. - This is a soft limit rather than a hard limit - RDS will - continue to accept and queue incoming messages, even if that - takes the queue length over the limit. However, it will also - mark the port as "congested" and send a congestion update to - the source node. The source node is supposed to throttle any - processes sending to this congested port. - - bind(fd, &sockaddr_in, ...) - This binds the socket to a local IP address and port, and a - transport, if one has not already been selected via the - SO_RDS_TRANSPORT socket option - - sendmsg(fd, ...) - Sends a message to the indicated recipient. The kernel will - transparently establish the underlying reliable connection - if it isn't up yet. - - An attempt to send a message that exceeds SO_SNDSIZE will - return with -EMSGSIZE - - An attempt to send a message that would take the total number - of queued bytes over the SO_SNDSIZE threshold will return - EAGAIN. - - An attempt to send a message to a destination that is marked - as "congested" will return ENOBUFS. - - recvmsg(fd, ...) - Receives a message that was queued to this socket. The sockets - recv queue accounting is adjusted, and if the queue length - drops below SO_SNDSIZE, the port is marked uncongested, and - a congestion update is sent to all peers. - - Applications can ask the RDS kernel module to receive - notifications via control messages (for instance, there is a - notification when a congestion update arrived, or when a RDMA - operation completes). These notifications are received through - the msg.msg_control buffer of struct msghdr. The format of the - messages is described in manpages. - - poll(fd) - RDS supports the poll interface to allow the application - to implement async I/O. - - POLLIN handling is pretty straightforward. When there's an - incoming message queued to the socket, or a pending notification, - we signal POLLIN. - - POLLOUT is a little harder. Since you can essentially send - to any destination, RDS will always signal POLLOUT as long as - there's room on the send queue (ie the number of bytes queued - is less than the sendbuf size). - - However, the kernel will refuse to accept messages to - a destination marked congested - in this case you will loop - forever if you rely on poll to tell you what to do. - This isn't a trivial problem, but applications can deal with - this - by using congestion notifications, and by checking for - ENOBUFS errors returned by sendmsg. - - setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in) - This allows the application to discard all messages queued to a - specific destination on this particular socket. - - This allows the application to cancel outstanding messages if - it detects a timeout. For instance, if it tried to send a message, - and the remote host is unreachable, RDS will keep trying forever. - The application may decide it's not worth it, and cancel the - operation. In this case, it would use RDS_CANCEL_SENT_TO to - nuke any pending messages. - - setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..) - getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..) - Set or read an integer defining the underlying - encapsulating transport to be used for RDS packets on the - socket. When setting the option, integer argument may be - one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the - value, RDS_TRANS_NONE will be returned on an unbound socket. - This socket option may only be set exactly once on the socket, - prior to binding it via the bind(2) system call. Attempts to - set SO_RDS_TRANSPORT on a socket for which the transport has - been previously attached explicitly (by SO_RDS_TRANSPORT) or - implicitly (via bind(2)) will return an error of EOPNOTSUPP. - An attempt to set SO_RDS_TRANSPORT to RDS_TRANS_NONE will - always return EINVAL. - -RDMA for RDS -============ - - see rds-rdma(7) manpage (available in rds-tools) - - -Congestion Notifications -======================== - - see rds(7) manpage - - -RDS Protocol -============ - - Message header - - The message header is a 'struct rds_header' (see rds.h): - Fields: - h_sequence: - per-packet sequence number - h_ack: - piggybacked acknowledgment of last packet received - h_len: - length of data, not including header - h_sport: - source port - h_dport: - destination port - h_flags: - CONG_BITMAP - this is a congestion update bitmap - ACK_REQUIRED - receiver must ack this packet - RETRANSMITTED - packet has previously been sent - h_credit: - indicate to other end of connection that - it has more credits available (i.e. there is - more send room) - h_padding[4]: - unused, for future use - h_csum: - header checksum - h_exthdr: - optional data can be passed here. This is currently used for - passing RDMA-related information. - - ACK and retransmit handling - - One might think that with reliable IB connections you wouldn't need - to ack messages that have been received. The problem is that IB - hardware generates an ack message before it has DMAed the message - into memory. This creates a potential message loss if the HCA is - disabled for any reason between when it sends the ack and before - the message is DMAed and processed. This is only a potential issue - if another HCA is available for fail-over. - - Sending an ack immediately would allow the sender to free the sent - message from their send queue quickly, but could cause excessive - traffic to be used for acks. RDS piggybacks acks on sent data - packets. Ack-only packets are reduced by only allowing one to be - in flight at a time, and by the sender only asking for acks when - its send buffers start to fill up. All retransmissions are also - acked. - - Flow Control - - RDS's IB transport uses a credit-based mechanism to verify that - there is space in the peer's receive buffers for more data. This - eliminates the need for hardware retries on the connection. - - Congestion - - Messages waiting in the receive queue on the receiving socket - are accounted against the sockets SO_RCVBUF option value. Only - the payload bytes in the message are accounted for. If the - number of bytes queued equals or exceeds rcvbuf then the socket - is congested. All sends attempted to this socket's address - should return block or return -EWOULDBLOCK. - - Applications are expected to be reasonably tuned such that this - situation very rarely occurs. An application encountering this - "back-pressure" is considered a bug. - - This is implemented by having each node maintain bitmaps which - indicate which ports on bound addresses are congested. As the - bitmap changes it is sent through all the connections which - terminate in the local address of the bitmap which changed. - - The bitmaps are allocated as connections are brought up. This - avoids allocation in the interrupt handling path which queues - sages on sockets. The dense bitmaps let transports send the - entire bitmap on any bitmap change reasonably efficiently. This - is much easier to implement than some finer-grained - communication of per-port congestion. The sender does a very - inexpensive bit test to test if the port it's about to send to - is congested or not. - - -RDS Transport Layer -================== - - As mentioned above, RDS is not IB-specific. Its code is divided - into a general RDS layer and a transport layer. - - The general layer handles the socket API, congestion handling, - loopback, stats, usermem pinning, and the connection state machine. - - The transport layer handles the details of the transport. The IB - transport, for example, handles all the queue pairs, work requests, - CM event handlers, and other Infiniband details. - - -RDS Kernel Structures -===================== - - struct rds_message - aka possibly "rds_outgoing", the generic RDS layer copies data to - be sent and sets header fields as needed, based on the socket API. - This is then queued for the individual connection and sent by the - connection's transport. - struct rds_incoming - a generic struct referring to incoming data that can be handed from - the transport to the general code and queued by the general code - while the socket is awoken. It is then passed back to the transport - code to handle the actual copy-to-user. - struct rds_socket - per-socket information - struct rds_connection - per-connection information - struct rds_transport - pointers to transport-specific functions - struct rds_statistics - non-transport-specific statistics - struct rds_cong_map - wraps the raw congestion bitmap, contains rbnode, waitq, etc. - -Connection management -===================== - - Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and - ERROR states. - - The first time an attempt is made by an RDS socket to send data to - a node, a connection is allocated and connected. That connection is - then maintained forever -- if there are transport errors, the - connection will be dropped and re-established. - - Dropping a connection while packets are queued will cause queued or - partially-sent datagrams to be retransmitted when the connection is - re-established. - - -The send path -============= - - rds_sendmsg() - struct rds_message built from incoming data - CMSGs parsed (e.g. RDMA ops) - transport connection alloced and connected if not already - rds_message placed on send queue - send worker awoken - rds_send_worker() - calls rds_send_xmit() until queue is empty - rds_send_xmit() - transmits congestion map if one is pending - may set ACK_REQUIRED - calls transport to send either non-RDMA or RDMA message - (RDMA ops never retransmitted) - rds_ib_xmit() - allocs work requests from send ring - adds any new send credits available to peer (h_credits) - maps the rds_message's sg list - piggybacks ack - populates work requests - post send to connection's queue pair - -The recv path -============= - - rds_ib_recv_cq_comp_handler() - looks at write completions - unmaps recv buffer from device - no errors, call rds_ib_process_recv() - refill recv ring - rds_ib_process_recv() - validate header checksum - copy header to rds_ib_incoming struct if start of a new datagram - add to ibinc's fraglist - if competed datagram: - update cong map if datagram was cong update - call rds_recv_incoming() otherwise - note if ack is required - rds_recv_incoming() - drop duplicate packets - respond to pings - find the sock associated with this datagram - add to sock queue - wake up sock - do some congestion calculations - rds_recvmsg - copy data into user iovec - handle CMSGs - return to application - -Multipath RDS (mprds) -===================== - Mprds is multipathed-RDS, primarily intended for RDS-over-TCP - (though the concept can be extended to other transports). The classical - implementation of RDS-over-TCP is implemented by demultiplexing multiple - PF_RDS sockets between any 2 endpoints (where endpoint == [IP address, - port]) over a single TCP socket between the 2 IP addresses involved. This - has the limitation that it ends up funneling multiple RDS flows over a - single TCP flow, thus it is - (a) upper-bounded to the single-flow bandwidth, - (b) suffers from head-of-line blocking for all the RDS sockets. - - Better throughput (for a fixed small packet size, MTU) can be achieved - by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed - RDS (mprds). Each such TCP/IP flow constitutes a path for the rds/tcp - connection. RDS sockets will be attached to a path based on some hash - (e.g., of local address and RDS port number) and packets for that RDS - socket will be sent over the attached path using TCP to segment/reassemble - RDS datagrams on that path. - - Multipathed RDS is implemented by splitting the struct rds_connection into - a common (to all paths) part, and a per-path struct rds_conn_path. All - I/O workqs and reconnect threads are driven from the rds_conn_path. - Transports such as TCP that are multipath capable may then set up a - TCP socket per rds_conn_path, and this is managed by the transport via - the transport privatee cp_transport_data pointer. - - Transports announce themselves as multipath capable by setting the - t_mp_capable bit during registration with the rds core module. When the - transport is multipath-capable, rds_sendmsg() hashes outgoing traffic - across multiple paths. The outgoing hash is computed based on the - local address and port that the PF_RDS socket is bound to. - - Additionally, even if the transport is MP capable, we may be - peering with some node that does not support mprds, or supports - a different number of paths. As a result, the peering nodes need - to agree on the number of paths to be used for the connection. - This is done by sending out a control packet exchange before the - first data packet. The control packet exchange must have completed - prior to outgoing hash completion in rds_sendmsg() when the transport - is mutlipath capable. - - The control packet is an RDS ping packet (i.e., packet to rds dest - port 0) with the ping packet having a rds extension header option of - type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the - number of paths supported by the sender. The "probe" ping packet will - get sent from some reserved port, RDS_FLAG_PROBE_PORT (in ) - The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately - be able to compute the min(sender_paths, rcvr_paths). The pong - sent in response to a probe-ping should contain the rcvr's npaths - when the rcvr is mprds-capable. - - If the rcvr is not mprds-capable, the exthdr in the ping will be - ignored. In this case the pong will not have any exthdrs, so the sender - of the probe-ping can default to single-path mprds. - diff --git a/MAINTAINERS b/MAINTAINERS index 785f56e5f210..ea5dd3d1df9d 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14219,7 +14219,7 @@ L: linux-rdma@vger.kernel.org L: rds-devel@oss.oracle.com (moderated for non-subscribers) S: Supported W: https://oss.oracle.com/projects/rds/ -F: Documentation/networking/rds.txt +F: Documentation/networking/rds.rst F: net/rds/ RDT - RESOURCE ALLOCATION -- cgit v1.2.3 From 98661e0c579dbda0e0910185f752fddd95e2d29c Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:20 +0200 Subject: docs: networking: convert regulatory.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/regulatory.rst | 209 ++++++++++++++++++++++++++++++++ Documentation/networking/regulatory.txt | 204 ------------------------------- MAINTAINERS | 2 +- 4 files changed, 211 insertions(+), 205 deletions(-) create mode 100644 Documentation/networking/regulatory.rst delete mode 100644 Documentation/networking/regulatory.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index e63a2cb2e4cb..bc3b04a2edde 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -98,6 +98,7 @@ Contents: radiotap-headers ray_cs rds + regulatory .. only:: subproject and html diff --git a/Documentation/networking/regulatory.rst b/Documentation/networking/regulatory.rst new file mode 100644 index 000000000000..8701b91e81ee --- /dev/null +++ b/Documentation/networking/regulatory.rst @@ -0,0 +1,209 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================= +Linux wireless regulatory documentation +======================================= + +This document gives a brief review over how the Linux wireless +regulatory infrastructure works. + +More up to date information can be obtained at the project's web page: + +http://wireless.kernel.org/en/developers/Regulatory + +Keeping regulatory domains in userspace +--------------------------------------- + +Due to the dynamic nature of regulatory domains we keep them +in userspace and provide a framework for userspace to upload +to the kernel one regulatory domain to be used as the central +core regulatory domain all wireless devices should adhere to. + +How to get regulatory domains to the kernel +------------------------------------------- + +When the regulatory domain is first set up, the kernel will request a +database file (regulatory.db) containing all the regulatory rules. It +will then use that database when it needs to look up the rules for a +given country. + +How to get regulatory domains to the kernel (old CRDA solution) +--------------------------------------------------------------- + +Userspace gets a regulatory domain in the kernel by having +a userspace agent build it and send it via nl80211. Only +expected regulatory domains will be respected by the kernel. + +A currently available userspace agent which can accomplish this +is CRDA - central regulatory domain agent. Its documented here: + +http://wireless.kernel.org/en/developers/Regulatory/CRDA + +Essentially the kernel will send a udev event when it knows +it needs a new regulatory domain. A udev rule can be put in place +to trigger crda to send the respective regulatory domain for a +specific ISO/IEC 3166 alpha2. + +Below is an example udev rule which can be used: + +# Example file, should be put in /etc/udev/rules.d/regulatory.rules +KERNEL=="regulatory*", ACTION=="change", SUBSYSTEM=="platform", RUN+="/sbin/crda" + +The alpha2 is passed as an environment variable under the variable COUNTRY. + +Who asks for regulatory domains? +-------------------------------- + +* Users + +Users can use iw: + +http://wireless.kernel.org/en/users/Documentation/iw + +An example:: + + # set regulatory domain to "Costa Rica" + iw reg set CR + +This will request the kernel to set the regulatory domain to +the specificied alpha2. The kernel in turn will then ask userspace +to provide a regulatory domain for the alpha2 specified by the user +by sending a uevent. + +* Wireless subsystems for Country Information elements + +The kernel will send a uevent to inform userspace a new +regulatory domain is required. More on this to be added +as its integration is added. + +* Drivers + +If drivers determine they need a specific regulatory domain +set they can inform the wireless core using regulatory_hint(). +They have two options -- they either provide an alpha2 so that +crda can provide back a regulatory domain for that country or +they can build their own regulatory domain based on internal +custom knowledge so the wireless core can respect it. + +*Most* drivers will rely on the first mechanism of providing a +regulatory hint with an alpha2. For these drivers there is an additional +check that can be used to ensure compliance based on custom EEPROM +regulatory data. This additional check can be used by drivers by +registering on its struct wiphy a reg_notifier() callback. This notifier +is called when the core's regulatory domain has been changed. The driver +can use this to review the changes made and also review who made them +(driver, user, country IE) and determine what to allow based on its +internal EEPROM data. Devices drivers wishing to be capable of world +roaming should use this callback. More on world roaming will be +added to this document when its support is enabled. + +Device drivers who provide their own built regulatory domain +do not need a callback as the channels registered by them are +the only ones that will be allowed and therefore *additional* +channels cannot be enabled. + +Example code - drivers hinting an alpha2: +------------------------------------------ + +This example comes from the zd1211rw device driver. You can start +by having a mapping of your device's EEPROM country/regulatory +domain value to a specific alpha2 as follows:: + + static struct zd_reg_alpha2_map reg_alpha2_map[] = { + { ZD_REGDOMAIN_FCC, "US" }, + { ZD_REGDOMAIN_IC, "CA" }, + { ZD_REGDOMAIN_ETSI, "DE" }, /* Generic ETSI, use most restrictive */ + { ZD_REGDOMAIN_JAPAN, "JP" }, + { ZD_REGDOMAIN_JAPAN_ADD, "JP" }, + { ZD_REGDOMAIN_SPAIN, "ES" }, + { ZD_REGDOMAIN_FRANCE, "FR" }, + +Then you can define a routine to map your read EEPROM value to an alpha2, +as follows:: + + static int zd_reg2alpha2(u8 regdomain, char *alpha2) + { + unsigned int i; + struct zd_reg_alpha2_map *reg_map; + for (i = 0; i < ARRAY_SIZE(reg_alpha2_map); i++) { + reg_map = ®_alpha2_map[i]; + if (regdomain == reg_map->reg) { + alpha2[0] = reg_map->alpha2[0]; + alpha2[1] = reg_map->alpha2[1]; + return 0; + } + } + return 1; + } + +Lastly, you can then hint to the core of your discovered alpha2, if a match +was found. You need to do this after you have registered your wiphy. You +are expected to do this during initialization. + +:: + + r = zd_reg2alpha2(mac->regdomain, alpha2); + if (!r) + regulatory_hint(hw->wiphy, alpha2); + +Example code - drivers providing a built in regulatory domain: +-------------------------------------------------------------- + +[NOTE: This API is not currently available, it can be added when required] + +If you have regulatory information you can obtain from your +driver and you *need* to use this we let you build a regulatory domain +structure and pass it to the wireless core. To do this you should +kmalloc() a structure big enough to hold your regulatory domain +structure and you should then fill it with your data. Finally you simply +call regulatory_hint() with the regulatory domain structure in it. + +Bellow is a simple example, with a regulatory domain cached using the stack. +Your implementation may vary (read EEPROM cache instead, for example). + +Example cache of some regulatory domain:: + + struct ieee80211_regdomain mydriver_jp_regdom = { + .n_reg_rules = 3, + .alpha2 = "JP", + //.alpha2 = "99", /* If I have no alpha2 to map it to */ + .reg_rules = { + /* IEEE 802.11b/g, channels 1..14 */ + REG_RULE(2412-10, 2484+10, 40, 6, 20, 0), + /* IEEE 802.11a, channels 34..48 */ + REG_RULE(5170-10, 5240+10, 40, 6, 20, + NL80211_RRF_NO_IR), + /* IEEE 802.11a, channels 52..64 */ + REG_RULE(5260-10, 5320+10, 40, 6, 20, + NL80211_RRF_NO_IR| + NL80211_RRF_DFS), + } + }; + +Then in some part of your code after your wiphy has been registered:: + + struct ieee80211_regdomain *rd; + int size_of_regd; + int num_rules = mydriver_jp_regdom.n_reg_rules; + unsigned int i; + + size_of_regd = sizeof(struct ieee80211_regdomain) + + (num_rules * sizeof(struct ieee80211_reg_rule)); + + rd = kzalloc(size_of_regd, GFP_KERNEL); + if (!rd) + return -ENOMEM; + + memcpy(rd, &mydriver_jp_regdom, sizeof(struct ieee80211_regdomain)); + + for (i=0; i < num_rules; i++) + memcpy(&rd->reg_rules[i], + &mydriver_jp_regdom.reg_rules[i], + sizeof(struct ieee80211_reg_rule)); + regulatory_struct_hint(rd); + +Statically compiled regulatory database +--------------------------------------- + +When a database should be fixed into the kernel, it can be provided as a +firmware file at build time that is then linked into the kernel. diff --git a/Documentation/networking/regulatory.txt b/Documentation/networking/regulatory.txt deleted file mode 100644 index 381e5b23d61d..000000000000 --- a/Documentation/networking/regulatory.txt +++ /dev/null @@ -1,204 +0,0 @@ -Linux wireless regulatory documentation ---------------------------------------- - -This document gives a brief review over how the Linux wireless -regulatory infrastructure works. - -More up to date information can be obtained at the project's web page: - -http://wireless.kernel.org/en/developers/Regulatory - -Keeping regulatory domains in userspace ---------------------------------------- - -Due to the dynamic nature of regulatory domains we keep them -in userspace and provide a framework for userspace to upload -to the kernel one regulatory domain to be used as the central -core regulatory domain all wireless devices should adhere to. - -How to get regulatory domains to the kernel -------------------------------------------- - -When the regulatory domain is first set up, the kernel will request a -database file (regulatory.db) containing all the regulatory rules. It -will then use that database when it needs to look up the rules for a -given country. - -How to get regulatory domains to the kernel (old CRDA solution) ---------------------------------------------------------------- - -Userspace gets a regulatory domain in the kernel by having -a userspace agent build it and send it via nl80211. Only -expected regulatory domains will be respected by the kernel. - -A currently available userspace agent which can accomplish this -is CRDA - central regulatory domain agent. Its documented here: - -http://wireless.kernel.org/en/developers/Regulatory/CRDA - -Essentially the kernel will send a udev event when it knows -it needs a new regulatory domain. A udev rule can be put in place -to trigger crda to send the respective regulatory domain for a -specific ISO/IEC 3166 alpha2. - -Below is an example udev rule which can be used: - -# Example file, should be put in /etc/udev/rules.d/regulatory.rules -KERNEL=="regulatory*", ACTION=="change", SUBSYSTEM=="platform", RUN+="/sbin/crda" - -The alpha2 is passed as an environment variable under the variable COUNTRY. - -Who asks for regulatory domains? --------------------------------- - -* Users - -Users can use iw: - -http://wireless.kernel.org/en/users/Documentation/iw - -An example: - - # set regulatory domain to "Costa Rica" - iw reg set CR - -This will request the kernel to set the regulatory domain to -the specificied alpha2. The kernel in turn will then ask userspace -to provide a regulatory domain for the alpha2 specified by the user -by sending a uevent. - -* Wireless subsystems for Country Information elements - -The kernel will send a uevent to inform userspace a new -regulatory domain is required. More on this to be added -as its integration is added. - -* Drivers - -If drivers determine they need a specific regulatory domain -set they can inform the wireless core using regulatory_hint(). -They have two options -- they either provide an alpha2 so that -crda can provide back a regulatory domain for that country or -they can build their own regulatory domain based on internal -custom knowledge so the wireless core can respect it. - -*Most* drivers will rely on the first mechanism of providing a -regulatory hint with an alpha2. For these drivers there is an additional -check that can be used to ensure compliance based on custom EEPROM -regulatory data. This additional check can be used by drivers by -registering on its struct wiphy a reg_notifier() callback. This notifier -is called when the core's regulatory domain has been changed. The driver -can use this to review the changes made and also review who made them -(driver, user, country IE) and determine what to allow based on its -internal EEPROM data. Devices drivers wishing to be capable of world -roaming should use this callback. More on world roaming will be -added to this document when its support is enabled. - -Device drivers who provide their own built regulatory domain -do not need a callback as the channels registered by them are -the only ones that will be allowed and therefore *additional* -channels cannot be enabled. - -Example code - drivers hinting an alpha2: ------------------------------------------- - -This example comes from the zd1211rw device driver. You can start -by having a mapping of your device's EEPROM country/regulatory -domain value to a specific alpha2 as follows: - -static struct zd_reg_alpha2_map reg_alpha2_map[] = { - { ZD_REGDOMAIN_FCC, "US" }, - { ZD_REGDOMAIN_IC, "CA" }, - { ZD_REGDOMAIN_ETSI, "DE" }, /* Generic ETSI, use most restrictive */ - { ZD_REGDOMAIN_JAPAN, "JP" }, - { ZD_REGDOMAIN_JAPAN_ADD, "JP" }, - { ZD_REGDOMAIN_SPAIN, "ES" }, - { ZD_REGDOMAIN_FRANCE, "FR" }, - -Then you can define a routine to map your read EEPROM value to an alpha2, -as follows: - -static int zd_reg2alpha2(u8 regdomain, char *alpha2) -{ - unsigned int i; - struct zd_reg_alpha2_map *reg_map; - for (i = 0; i < ARRAY_SIZE(reg_alpha2_map); i++) { - reg_map = ®_alpha2_map[i]; - if (regdomain == reg_map->reg) { - alpha2[0] = reg_map->alpha2[0]; - alpha2[1] = reg_map->alpha2[1]; - return 0; - } - } - return 1; -} - -Lastly, you can then hint to the core of your discovered alpha2, if a match -was found. You need to do this after you have registered your wiphy. You -are expected to do this during initialization. - - r = zd_reg2alpha2(mac->regdomain, alpha2); - if (!r) - regulatory_hint(hw->wiphy, alpha2); - -Example code - drivers providing a built in regulatory domain: --------------------------------------------------------------- - -[NOTE: This API is not currently available, it can be added when required] - -If you have regulatory information you can obtain from your -driver and you *need* to use this we let you build a regulatory domain -structure and pass it to the wireless core. To do this you should -kmalloc() a structure big enough to hold your regulatory domain -structure and you should then fill it with your data. Finally you simply -call regulatory_hint() with the regulatory domain structure in it. - -Bellow is a simple example, with a regulatory domain cached using the stack. -Your implementation may vary (read EEPROM cache instead, for example). - -Example cache of some regulatory domain - -struct ieee80211_regdomain mydriver_jp_regdom = { - .n_reg_rules = 3, - .alpha2 = "JP", - //.alpha2 = "99", /* If I have no alpha2 to map it to */ - .reg_rules = { - /* IEEE 802.11b/g, channels 1..14 */ - REG_RULE(2412-10, 2484+10, 40, 6, 20, 0), - /* IEEE 802.11a, channels 34..48 */ - REG_RULE(5170-10, 5240+10, 40, 6, 20, - NL80211_RRF_NO_IR), - /* IEEE 802.11a, channels 52..64 */ - REG_RULE(5260-10, 5320+10, 40, 6, 20, - NL80211_RRF_NO_IR| - NL80211_RRF_DFS), - } -}; - -Then in some part of your code after your wiphy has been registered: - - struct ieee80211_regdomain *rd; - int size_of_regd; - int num_rules = mydriver_jp_regdom.n_reg_rules; - unsigned int i; - - size_of_regd = sizeof(struct ieee80211_regdomain) + - (num_rules * sizeof(struct ieee80211_reg_rule)); - - rd = kzalloc(size_of_regd, GFP_KERNEL); - if (!rd) - return -ENOMEM; - - memcpy(rd, &mydriver_jp_regdom, sizeof(struct ieee80211_regdomain)); - - for (i=0; i < num_rules; i++) - memcpy(&rd->reg_rules[i], - &mydriver_jp_regdom.reg_rules[i], - sizeof(struct ieee80211_reg_rule)); - regulatory_struct_hint(rd); - -Statically compiled regulatory database ---------------------------------------- - -When a database should be fixed into the kernel, it can be provided as a -firmware file at build time that is then linked into the kernel. diff --git a/MAINTAINERS b/MAINTAINERS index ea5dd3d1df9d..b28823ab48c5 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -193,7 +193,7 @@ W: https://wireless.wiki.kernel.org/ T: git git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git T: git git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next.git F: Documentation/driver-api/80211/cfg80211.rst -F: Documentation/networking/regulatory.txt +F: Documentation/networking/regulatory.rst F: include/linux/ieee80211.h F: include/net/cfg80211.h F: include/net/ieee80211_radiotap.h -- cgit v1.2.3 From 9f72374cb5959556870be8078b128158edde5d3e Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:21 +0200 Subject: docs: networking: convert rxrpc.txt to ReST - add SPDX header; - adjust title markup; - use autonumbered list markups; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/filesystems/afs.rst | 2 +- Documentation/networking/index.rst | 1 + Documentation/networking/rxrpc.rst | 1169 ++++++++++++++++++++++++++++++++++++ Documentation/networking/rxrpc.txt | 1155 ----------------------------------- MAINTAINERS | 2 +- net/rxrpc/Kconfig | 6 +- net/rxrpc/sysctl.c | 2 +- 7 files changed, 1176 insertions(+), 1161 deletions(-) create mode 100644 Documentation/networking/rxrpc.rst delete mode 100644 Documentation/networking/rxrpc.txt (limited to 'Documentation') diff --git a/Documentation/filesystems/afs.rst b/Documentation/filesystems/afs.rst index c4ec39a5966e..cada9464d6bd 100644 --- a/Documentation/filesystems/afs.rst +++ b/Documentation/filesystems/afs.rst @@ -70,7 +70,7 @@ list of volume location server IP addresses:: The first module is the AF_RXRPC network protocol driver. This provides the RxRPC remote operation protocol and may also be accessed from userspace. See: - Documentation/networking/rxrpc.txt + Documentation/networking/rxrpc.rst The second module is the kerberos RxRPC security driver, and the third module is the actual filesystem driver for the AFS filesystem. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index bc3b04a2edde..cd307b9601fa 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -99,6 +99,7 @@ Contents: ray_cs rds regulatory + rxrpc .. only:: subproject and html diff --git a/Documentation/networking/rxrpc.rst b/Documentation/networking/rxrpc.rst new file mode 100644 index 000000000000..5ad35113d0f4 --- /dev/null +++ b/Documentation/networking/rxrpc.rst @@ -0,0 +1,1169 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================== +RxRPC Network Protocol +====================== + +The RxRPC protocol driver provides a reliable two-phase transport on top of UDP +that can be used to perform RxRPC remote operations. This is done over sockets +of AF_RXRPC family, using sendmsg() and recvmsg() with control data to send and +receive data, aborts and errors. + +Contents of this document: + + (#) Overview. + + (#) RxRPC protocol summary. + + (#) AF_RXRPC driver model. + + (#) Control messages. + + (#) Socket options. + + (#) Security. + + (#) Example client usage. + + (#) Example server usage. + + (#) AF_RXRPC kernel interface. + + (#) Configurable parameters. + + +Overview +======== + +RxRPC is a two-layer protocol. There is a session layer which provides +reliable virtual connections using UDP over IPv4 (or IPv6) as the transport +layer, but implements a real network protocol; and there's the presentation +layer which renders structured data to binary blobs and back again using XDR +(as does SunRPC):: + + +-------------+ + | Application | + +-------------+ + | XDR | Presentation + +-------------+ + | RxRPC | Session + +-------------+ + | UDP | Transport + +-------------+ + + +AF_RXRPC provides: + + (1) Part of an RxRPC facility for both kernel and userspace applications by + making the session part of it a Linux network protocol (AF_RXRPC). + + (2) A two-phase protocol. The client transmits a blob (the request) and then + receives a blob (the reply), and the server receives the request and then + transmits the reply. + + (3) Retention of the reusable bits of the transport system set up for one call + to speed up subsequent calls. + + (4) A secure protocol, using the Linux kernel's key retention facility to + manage security on the client end. The server end must of necessity be + more active in security negotiations. + +AF_RXRPC does not provide XDR marshalling/presentation facilities. That is +left to the application. AF_RXRPC only deals in blobs. Even the operation ID +is just the first four bytes of the request blob, and as such is beyond the +kernel's interest. + + +Sockets of AF_RXRPC family are: + + (1) created as type SOCK_DGRAM; + + (2) provided with a protocol of the type of underlying transport they're going + to use - currently only PF_INET is supported. + + +The Andrew File System (AFS) is an example of an application that uses this and +that has both kernel (filesystem) and userspace (utility) components. + + +RxRPC Protocol Summary +====================== + +An overview of the RxRPC protocol: + + (#) RxRPC sits on top of another networking protocol (UDP is the only option + currently), and uses this to provide network transport. UDP ports, for + example, provide transport endpoints. + + (#) RxRPC supports multiple virtual "connections" from any given transport + endpoint, thus allowing the endpoints to be shared, even to the same + remote endpoint. + + (#) Each connection goes to a particular "service". A connection may not go + to multiple services. A service may be considered the RxRPC equivalent of + a port number. AF_RXRPC permits multiple services to share an endpoint. + + (#) Client-originating packets are marked, thus a transport endpoint can be + shared between client and server connections (connections have a + direction). + + (#) Up to a billion connections may be supported concurrently between one + local transport endpoint and one service on one remote endpoint. An RxRPC + connection is described by seven numbers:: + + Local address } + Local port } Transport (UDP) address + Remote address } + Remote port } + Direction + Connection ID + Service ID + + (#) Each RxRPC operation is a "call". A connection may make up to four + billion calls, but only up to four calls may be in progress on a + connection at any one time. + + (#) Calls are two-phase and asymmetric: the client sends its request data, + which the service receives; then the service transmits the reply data + which the client receives. + + (#) The data blobs are of indefinite size, the end of a phase is marked with a + flag in the packet. The number of packets of data making up one blob may + not exceed 4 billion, however, as this would cause the sequence number to + wrap. + + (#) The first four bytes of the request data are the service operation ID. + + (#) Security is negotiated on a per-connection basis. The connection is + initiated by the first data packet on it arriving. If security is + requested, the server then issues a "challenge" and then the client + replies with a "response". If the response is successful, the security is + set for the lifetime of that connection, and all subsequent calls made + upon it use that same security. In the event that the server lets a + connection lapse before the client, the security will be renegotiated if + the client uses the connection again. + + (#) Calls use ACK packets to handle reliability. Data packets are also + explicitly sequenced per call. + + (#) There are two types of positive acknowledgment: hard-ACKs and soft-ACKs. + A hard-ACK indicates to the far side that all the data received to a point + has been received and processed; a soft-ACK indicates that the data has + been received but may yet be discarded and re-requested. The sender may + not discard any transmittable packets until they've been hard-ACK'd. + + (#) Reception of a reply data packet implicitly hard-ACK's all the data + packets that make up the request. + + (#) An call is complete when the request has been sent, the reply has been + received and the final hard-ACK on the last packet of the reply has + reached the server. + + (#) An call may be aborted by either end at any time up to its completion. + + +AF_RXRPC Driver Model +===================== + +About the AF_RXRPC driver: + + (#) The AF_RXRPC protocol transparently uses internal sockets of the transport + protocol to represent transport endpoints. + + (#) AF_RXRPC sockets map onto RxRPC connection bundles. Actual RxRPC + connections are handled transparently. One client socket may be used to + make multiple simultaneous calls to the same service. One server socket + may handle calls from many clients. + + (#) Additional parallel client connections will be initiated to support extra + concurrent calls, up to a tunable limit. + + (#) Each connection is retained for a certain amount of time [tunable] after + the last call currently using it has completed in case a new call is made + that could reuse it. + + (#) Each internal UDP socket is retained [tunable] for a certain amount of + time [tunable] after the last connection using it discarded, in case a new + connection is made that could use it. + + (#) A client-side connection is only shared between calls if they have have + the same key struct describing their security (and assuming the calls + would otherwise share the connection). Non-secured calls would also be + able to share connections with each other. + + (#) A server-side connection is shared if the client says it is. + + (#) ACK'ing is handled by the protocol driver automatically, including ping + replying. + + (#) SO_KEEPALIVE automatically pings the other side to keep the connection + alive [TODO]. + + (#) If an ICMP error is received, all calls affected by that error will be + aborted with an appropriate network error passed through recvmsg(). + + +Interaction with the user of the RxRPC socket: + + (#) A socket is made into a server socket by binding an address with a + non-zero service ID. + + (#) In the client, sending a request is achieved with one or more sendmsgs, + followed by the reply being received with one or more recvmsgs. + + (#) The first sendmsg for a request to be sent from a client contains a tag to + be used in all other sendmsgs or recvmsgs associated with that call. The + tag is carried in the control data. + + (#) connect() is used to supply a default destination address for a client + socket. This may be overridden by supplying an alternate address to the + first sendmsg() of a call (struct msghdr::msg_name). + + (#) If connect() is called on an unbound client, a random local port will + bound before the operation takes place. + + (#) A server socket may also be used to make client calls. To do this, the + first sendmsg() of the call must specify the target address. The server's + transport endpoint is used to send the packets. + + (#) Once the application has received the last message associated with a call, + the tag is guaranteed not to be seen again, and so it can be used to pin + client resources. A new call can then be initiated with the same tag + without fear of interference. + + (#) In the server, a request is received with one or more recvmsgs, then the + the reply is transmitted with one or more sendmsgs, and then the final ACK + is received with a last recvmsg. + + (#) When sending data for a call, sendmsg is given MSG_MORE if there's more + data to come on that call. + + (#) When receiving data for a call, recvmsg flags MSG_MORE if there's more + data to come for that call. + + (#) When receiving data or messages for a call, MSG_EOR is flagged by recvmsg + to indicate the terminal message for that call. + + (#) A call may be aborted by adding an abort control message to the control + data. Issuing an abort terminates the kernel's use of that call's tag. + Any messages waiting in the receive queue for that call will be discarded. + + (#) Aborts, busy notifications and challenge packets are delivered by recvmsg, + and control data messages will be set to indicate the context. Receiving + an abort or a busy message terminates the kernel's use of that call's tag. + + (#) The control data part of the msghdr struct is used for a number of things: + + (#) The tag of the intended or affected call. + + (#) Sending or receiving errors, aborts and busy notifications. + + (#) Notifications of incoming calls. + + (#) Sending debug requests and receiving debug replies [TODO]. + + (#) When the kernel has received and set up an incoming call, it sends a + message to server application to let it know there's a new call awaiting + its acceptance [recvmsg reports a special control message]. The server + application then uses sendmsg to assign a tag to the new call. Once that + is done, the first part of the request data will be delivered by recvmsg. + + (#) The server application has to provide the server socket with a keyring of + secret keys corresponding to the security types it permits. When a secure + connection is being set up, the kernel looks up the appropriate secret key + in the keyring and then sends a challenge packet to the client and + receives a response packet. The kernel then checks the authorisation of + the packet and either aborts the connection or sets up the security. + + (#) The name of the key a client will use to secure its communications is + nominated by a socket option. + + +Notes on sendmsg: + + (#) MSG_WAITALL can be set to tell sendmsg to ignore signals if the peer is + making progress at accepting packets within a reasonable time such that we + manage to queue up all the data for transmission. This requires the + client to accept at least one packet per 2*RTT time period. + + If this isn't set, sendmsg() will return immediately, either returning + EINTR/ERESTARTSYS if nothing was consumed or returning the amount of data + consumed. + + +Notes on recvmsg: + + (#) If there's a sequence of data messages belonging to a particular call on + the receive queue, then recvmsg will keep working through them until: + + (a) it meets the end of that call's received data, + + (b) it meets a non-data message, + + (c) it meets a message belonging to a different call, or + + (d) it fills the user buffer. + + If recvmsg is called in blocking mode, it will keep sleeping, awaiting the + reception of further data, until one of the above four conditions is met. + + (2) MSG_PEEK operates similarly, but will return immediately if it has put any + data in the buffer rather than sleeping until it can fill the buffer. + + (3) If a data message is only partially consumed in filling a user buffer, + then the remainder of that message will be left on the front of the queue + for the next taker. MSG_TRUNC will never be flagged. + + (4) If there is more data to be had on a call (it hasn't copied the last byte + of the last data message in that phase yet), then MSG_MORE will be + flagged. + + +Control Messages +================ + +AF_RXRPC makes use of control messages in sendmsg() and recvmsg() to multiplex +calls, to invoke certain actions and to report certain conditions. These are: + + ======================= === =========== =============================== + MESSAGE ID SRT DATA MEANING + ======================= === =========== =============================== + RXRPC_USER_CALL_ID sr- User ID App's call specifier + RXRPC_ABORT srt Abort code Abort code to issue/received + RXRPC_ACK -rt n/a Final ACK received + RXRPC_NET_ERROR -rt error num Network error on call + RXRPC_BUSY -rt n/a Call rejected (server busy) + RXRPC_LOCAL_ERROR -rt error num Local error encountered + RXRPC_NEW_CALL -r- n/a New call received + RXRPC_ACCEPT s-- n/a Accept new call + RXRPC_EXCLUSIVE_CALL s-- n/a Make an exclusive client call + RXRPC_UPGRADE_SERVICE s-- n/a Client call can be upgraded + RXRPC_TX_LENGTH s-- data len Total length of Tx data + ======================= === =========== =============================== + + (SRT = usable in Sendmsg / delivered by Recvmsg / Terminal message) + + (#) RXRPC_USER_CALL_ID + + This is used to indicate the application's call ID. It's an unsigned long + that the app specifies in the client by attaching it to the first data + message or in the server by passing it in association with an RXRPC_ACCEPT + message. recvmsg() passes it in conjunction with all messages except + those of the RXRPC_NEW_CALL message. + + (#) RXRPC_ABORT + + This is can be used by an application to abort a call by passing it to + sendmsg, or it can be delivered by recvmsg to indicate a remote abort was + received. Either way, it must be associated with an RXRPC_USER_CALL_ID to + specify the call affected. If an abort is being sent, then error EBADSLT + will be returned if there is no call with that user ID. + + (#) RXRPC_ACK + + This is delivered to a server application to indicate that the final ACK + of a call was received from the client. It will be associated with an + RXRPC_USER_CALL_ID to indicate the call that's now complete. + + (#) RXRPC_NET_ERROR + + This is delivered to an application to indicate that an ICMP error message + was encountered in the process of trying to talk to the peer. An + errno-class integer value will be included in the control message data + indicating the problem, and an RXRPC_USER_CALL_ID will indicate the call + affected. + + (#) RXRPC_BUSY + + This is delivered to a client application to indicate that a call was + rejected by the server due to the server being busy. It will be + associated with an RXRPC_USER_CALL_ID to indicate the rejected call. + + (#) RXRPC_LOCAL_ERROR + + This is delivered to an application to indicate that a local error was + encountered and that a call has been aborted because of it. An + errno-class integer value will be included in the control message data + indicating the problem, and an RXRPC_USER_CALL_ID will indicate the call + affected. + + (#) RXRPC_NEW_CALL + + This is delivered to indicate to a server application that a new call has + arrived and is awaiting acceptance. No user ID is associated with this, + as a user ID must subsequently be assigned by doing an RXRPC_ACCEPT. + + (#) RXRPC_ACCEPT + + This is used by a server application to attempt to accept a call and + assign it a user ID. It should be associated with an RXRPC_USER_CALL_ID + to indicate the user ID to be assigned. If there is no call to be + accepted (it may have timed out, been aborted, etc.), then sendmsg will + return error ENODATA. If the user ID is already in use by another call, + then error EBADSLT will be returned. + + (#) RXRPC_EXCLUSIVE_CALL + + This is used to indicate that a client call should be made on a one-off + connection. The connection is discarded once the call has terminated. + + (#) RXRPC_UPGRADE_SERVICE + + This is used to make a client call to probe if the specified service ID + may be upgraded by the server. The caller must check msg_name returned to + recvmsg() for the service ID actually in use. The operation probed must + be one that takes the same arguments in both services. + + Once this has been used to establish the upgrade capability (or lack + thereof) of the server, the service ID returned should be used for all + future communication to that server and RXRPC_UPGRADE_SERVICE should no + longer be set. + + (#) RXRPC_TX_LENGTH + + This is used to inform the kernel of the total amount of data that is + going to be transmitted by a call (whether in a client request or a + service response). If given, it allows the kernel to encrypt from the + userspace buffer directly to the packet buffers, rather than copying into + the buffer and then encrypting in place. This may only be given with the + first sendmsg() providing data for a call. EMSGSIZE will be generated if + the amount of data actually given is different. + + This takes a parameter of __s64 type that indicates how much will be + transmitted. This may not be less than zero. + +The symbol RXRPC__SUPPORTED is defined as one more than the highest control +message type supported. At run time this can be queried by means of the +RXRPC_SUPPORTED_CMSG socket option (see below). + + +============== +SOCKET OPTIONS +============== + +AF_RXRPC sockets support a few socket options at the SOL_RXRPC level: + + (#) RXRPC_SECURITY_KEY + + This is used to specify the description of the key to be used. The key is + extracted from the calling process's keyrings with request_key() and + should be of "rxrpc" type. + + The optval pointer points to the description string, and optlen indicates + how long the string is, without the NUL terminator. + + (#) RXRPC_SECURITY_KEYRING + + Similar to above but specifies a keyring of server secret keys to use (key + type "keyring"). See the "Security" section. + + (#) RXRPC_EXCLUSIVE_CONNECTION + + This is used to request that new connections should be used for each call + made subsequently on this socket. optval should be NULL and optlen 0. + + (#) RXRPC_MIN_SECURITY_LEVEL + + This is used to specify the minimum security level required for calls on + this socket. optval must point to an int containing one of the following + values: + + (a) RXRPC_SECURITY_PLAIN + + Encrypted checksum only. + + (b) RXRPC_SECURITY_AUTH + + Encrypted checksum plus packet padded and first eight bytes of packet + encrypted - which includes the actual packet length. + + (c) RXRPC_SECURITY_ENCRYPTED + + Encrypted checksum plus entire packet padded and encrypted, including + actual packet length. + + (#) RXRPC_UPGRADEABLE_SERVICE + + This is used to indicate that a service socket with two bindings may + upgrade one bound service to the other if requested by the client. optval + must point to an array of two unsigned short ints. The first is the + service ID to upgrade from and the second the service ID to upgrade to. + + (#) RXRPC_SUPPORTED_CMSG + + This is a read-only option that writes an int into the buffer indicating + the highest control message type supported. + + +======== +SECURITY +======== + +Currently, only the kerberos 4 equivalent protocol has been implemented +(security index 2 - rxkad). This requires the rxkad module to be loaded and, +on the client, tickets of the appropriate type to be obtained from the AFS +kaserver or the kerberos server and installed as "rxrpc" type keys. This is +normally done using the klog program. An example simple klog program can be +found at: + + http://people.redhat.com/~dhowells/rxrpc/klog.c + +The payload provided to add_key() on the client should be of the following +form:: + + struct rxrpc_key_sec2_v1 { + uint16_t security_index; /* 2 */ + uint16_t ticket_length; /* length of ticket[] */ + uint32_t expiry; /* time at which expires */ + uint8_t kvno; /* key version number */ + uint8_t __pad[3]; + uint8_t session_key[8]; /* DES session key */ + uint8_t ticket[0]; /* the encrypted ticket */ + }; + +Where the ticket blob is just appended to the above structure. + + +For the server, keys of type "rxrpc_s" must be made available to the server. +They have a description of ":" (eg: "52:2" for an +rxkad key for the AFS VL service). When such a key is created, it should be +given the server's secret key as the instantiation data (see the example +below). + + add_key("rxrpc_s", "52:2", secret_key, 8, keyring); + +A keyring is passed to the server socket by naming it in a sockopt. The server +socket then looks the server secret keys up in this keyring when secure +incoming connections are made. This can be seen in an example program that can +be found at: + + http://people.redhat.com/~dhowells/rxrpc/listen.c + + +==================== +EXAMPLE CLIENT USAGE +==================== + +A client would issue an operation by: + + (1) An RxRPC socket is set up by:: + + client = socket(AF_RXRPC, SOCK_DGRAM, PF_INET); + + Where the third parameter indicates the protocol family of the transport + socket used - usually IPv4 but it can also be IPv6 [TODO]. + + (2) A local address can optionally be bound:: + + struct sockaddr_rxrpc srx = { + .srx_family = AF_RXRPC, + .srx_service = 0, /* we're a client */ + .transport_type = SOCK_DGRAM, /* type of transport socket */ + .transport.sin_family = AF_INET, + .transport.sin_port = htons(7000), /* AFS callback */ + .transport.sin_address = 0, /* all local interfaces */ + }; + bind(client, &srx, sizeof(srx)); + + This specifies the local UDP port to be used. If not given, a random + non-privileged port will be used. A UDP port may be shared between + several unrelated RxRPC sockets. Security is handled on a basis of + per-RxRPC virtual connection. + + (3) The security is set:: + + const char *key = "AFS:cambridge.redhat.com"; + setsockopt(client, SOL_RXRPC, RXRPC_SECURITY_KEY, key, strlen(key)); + + This issues a request_key() to get the key representing the security + context. The minimum security level can be set:: + + unsigned int sec = RXRPC_SECURITY_ENCRYPTED; + setsockopt(client, SOL_RXRPC, RXRPC_MIN_SECURITY_LEVEL, + &sec, sizeof(sec)); + + (4) The server to be contacted can then be specified (alternatively this can + be done through sendmsg):: + + struct sockaddr_rxrpc srx = { + .srx_family = AF_RXRPC, + .srx_service = VL_SERVICE_ID, + .transport_type = SOCK_DGRAM, /* type of transport socket */ + .transport.sin_family = AF_INET, + .transport.sin_port = htons(7005), /* AFS volume manager */ + .transport.sin_address = ..., + }; + connect(client, &srx, sizeof(srx)); + + (5) The request data should then be posted to the server socket using a series + of sendmsg() calls, each with the following control message attached: + + ================== =================================== + RXRPC_USER_CALL_ID specifies the user ID for this call + ================== =================================== + + MSG_MORE should be set in msghdr::msg_flags on all but the last part of + the request. Multiple requests may be made simultaneously. + + An RXRPC_TX_LENGTH control message can also be specified on the first + sendmsg() call. + + If a call is intended to go to a destination other than the default + specified through connect(), then msghdr::msg_name should be set on the + first request message of that call. + + (6) The reply data will then be posted to the server socket for recvmsg() to + pick up. MSG_MORE will be flagged by recvmsg() if there's more reply data + for a particular call to be read. MSG_EOR will be set on the terminal + read for a call. + + All data will be delivered with the following control message attached: + + RXRPC_USER_CALL_ID - specifies the user ID for this call + + If an abort or error occurred, this will be returned in the control data + buffer instead, and MSG_EOR will be flagged to indicate the end of that + call. + +A client may ask for a service ID it knows and ask that this be upgraded to a +better service if one is available by supplying RXRPC_UPGRADE_SERVICE on the +first sendmsg() of a call. The client should then check srx_service in the +msg_name filled in by recvmsg() when collecting the result. srx_service will +hold the same value as given to sendmsg() if the upgrade request was ignored by +the service - otherwise it will be altered to indicate the service ID the +server upgraded to. Note that the upgraded service ID is chosen by the server. +The caller has to wait until it sees the service ID in the reply before sending +any more calls (further calls to the same destination will be blocked until the +probe is concluded). + + +Example Server Usage +==================== + +A server would be set up to accept operations in the following manner: + + (1) An RxRPC socket is created by:: + + server = socket(AF_RXRPC, SOCK_DGRAM, PF_INET); + + Where the third parameter indicates the address type of the transport + socket used - usually IPv4. + + (2) Security is set up if desired by giving the socket a keyring with server + secret keys in it:: + + keyring = add_key("keyring", "AFSkeys", NULL, 0, + KEY_SPEC_PROCESS_KEYRING); + + const char secret_key[8] = { + 0xa7, 0x83, 0x8a, 0xcb, 0xc7, 0x83, 0xec, 0x94 }; + add_key("rxrpc_s", "52:2", secret_key, 8, keyring); + + setsockopt(server, SOL_RXRPC, RXRPC_SECURITY_KEYRING, "AFSkeys", 7); + + The keyring can be manipulated after it has been given to the socket. This + permits the server to add more keys, replace keys, etc. while it is live. + + (3) A local address must then be bound:: + + struct sockaddr_rxrpc srx = { + .srx_family = AF_RXRPC, + .srx_service = VL_SERVICE_ID, /* RxRPC service ID */ + .transport_type = SOCK_DGRAM, /* type of transport socket */ + .transport.sin_family = AF_INET, + .transport.sin_port = htons(7000), /* AFS callback */ + .transport.sin_address = 0, /* all local interfaces */ + }; + bind(server, &srx, sizeof(srx)); + + More than one service ID may be bound to a socket, provided the transport + parameters are the same. The limit is currently two. To do this, bind() + should be called twice. + + (4) If service upgrading is required, first two service IDs must have been + bound and then the following option must be set:: + + unsigned short service_ids[2] = { from_ID, to_ID }; + setsockopt(server, SOL_RXRPC, RXRPC_UPGRADEABLE_SERVICE, + service_ids, sizeof(service_ids)); + + This will automatically upgrade connections on service from_ID to service + to_ID if they request it. This will be reflected in msg_name obtained + through recvmsg() when the request data is delivered to userspace. + + (5) The server is then set to listen out for incoming calls:: + + listen(server, 100); + + (6) The kernel notifies the server of pending incoming connections by sending + it a message for each. This is received with recvmsg() on the server + socket. It has no data, and has a single dataless control message + attached:: + + RXRPC_NEW_CALL + + The address that can be passed back by recvmsg() at this point should be + ignored since the call for which the message was posted may have gone by + the time it is accepted - in which case the first call still on the queue + will be accepted. + + (7) The server then accepts the new call by issuing a sendmsg() with two + pieces of control data and no actual data: + + ================== ============================== + RXRPC_ACCEPT indicate connection acceptance + RXRPC_USER_CALL_ID specify user ID for this call + ================== ============================== + + (8) The first request data packet will then be posted to the server socket for + recvmsg() to pick up. At that point, the RxRPC address for the call can + be read from the address fields in the msghdr struct. + + Subsequent request data will be posted to the server socket for recvmsg() + to collect as it arrives. All but the last piece of the request data will + be delivered with MSG_MORE flagged. + + All data will be delivered with the following control message attached: + + + ================== =================================== + RXRPC_USER_CALL_ID specifies the user ID for this call + ================== =================================== + + (9) The reply data should then be posted to the server socket using a series + of sendmsg() calls, each with the following control messages attached: + + ================== =================================== + RXRPC_USER_CALL_ID specifies the user ID for this call + ================== =================================== + + MSG_MORE should be set in msghdr::msg_flags on all but the last message + for a particular call. + +(10) The final ACK from the client will be posted for retrieval by recvmsg() + when it is received. It will take the form of a dataless message with two + control messages attached: + + ================== =================================== + RXRPC_USER_CALL_ID specifies the user ID for this call + RXRPC_ACK indicates final ACK (no data) + ================== =================================== + + MSG_EOR will be flagged to indicate that this is the final message for + this call. + +(11) Up to the point the final packet of reply data is sent, the call can be + aborted by calling sendmsg() with a dataless message with the following + control messages attached: + + ================== =================================== + RXRPC_USER_CALL_ID specifies the user ID for this call + RXRPC_ABORT indicates abort code (4 byte data) + ================== =================================== + + Any packets waiting in the socket's receive queue will be discarded if + this is issued. + +Note that all the communications for a particular service take place through +the one server socket, using control messages on sendmsg() and recvmsg() to +determine the call affected. + + +AF_RXRPC Kernel Interface +========================= + +The AF_RXRPC module also provides an interface for use by in-kernel utilities +such as the AFS filesystem. This permits such a utility to: + + (1) Use different keys directly on individual client calls on one socket + rather than having to open a whole slew of sockets, one for each key it + might want to use. + + (2) Avoid having RxRPC call request_key() at the point of issue of a call or + opening of a socket. Instead the utility is responsible for requesting a + key at the appropriate point. AFS, for instance, would do this during VFS + operations such as open() or unlink(). The key is then handed through + when the call is initiated. + + (3) Request the use of something other than GFP_KERNEL to allocate memory. + + (4) Avoid the overhead of using the recvmsg() call. RxRPC messages can be + intercepted before they get put into the socket Rx queue and the socket + buffers manipulated directly. + +To use the RxRPC facility, a kernel utility must still open an AF_RXRPC socket, +bind an address as appropriate and listen if it's to be a server socket, but +then it passes this to the kernel interface functions. + +The kernel interface functions are as follows: + + (#) Begin a new client call:: + + struct rxrpc_call * + rxrpc_kernel_begin_call(struct socket *sock, + struct sockaddr_rxrpc *srx, + struct key *key, + unsigned long user_call_ID, + s64 tx_total_len, + gfp_t gfp, + rxrpc_notify_rx_t notify_rx, + bool upgrade, + bool intr, + unsigned int debug_id); + + This allocates the infrastructure to make a new RxRPC call and assigns + call and connection numbers. The call will be made on the UDP port that + the socket is bound to. The call will go to the destination address of a + connected client socket unless an alternative is supplied (srx is + non-NULL). + + If a key is supplied then this will be used to secure the call instead of + the key bound to the socket with the RXRPC_SECURITY_KEY sockopt. Calls + secured in this way will still share connections if at all possible. + + The user_call_ID is equivalent to that supplied to sendmsg() in the + control data buffer. It is entirely feasible to use this to point to a + kernel data structure. + + tx_total_len is the amount of data the caller is intending to transmit + with this call (or -1 if unknown at this point). Setting the data size + allows the kernel to encrypt directly to the packet buffers, thereby + saving a copy. The value may not be less than -1. + + notify_rx is a pointer to a function to be called when events such as + incoming data packets or remote aborts happen. + + upgrade should be set to true if a client operation should request that + the server upgrade the service to a better one. The resultant service ID + is returned by rxrpc_kernel_recv_data(). + + intr should be set to true if the call should be interruptible. If this + is not set, this function may not return until a channel has been + allocated; if it is set, the function may return -ERESTARTSYS. + + debug_id is the call debugging ID to be used for tracing. This can be + obtained by atomically incrementing rxrpc_debug_id. + + If this function is successful, an opaque reference to the RxRPC call is + returned. The caller now holds a reference on this and it must be + properly ended. + + (#) End a client call:: + + void rxrpc_kernel_end_call(struct socket *sock, + struct rxrpc_call *call); + + This is used to end a previously begun call. The user_call_ID is expunged + from AF_RXRPC's knowledge and will not be seen again in association with + the specified call. + + (#) Send data through a call:: + + typedef void (*rxrpc_notify_end_tx_t)(struct sock *sk, + unsigned long user_call_ID, + struct sk_buff *skb); + + int rxrpc_kernel_send_data(struct socket *sock, + struct rxrpc_call *call, + struct msghdr *msg, + size_t len, + rxrpc_notify_end_tx_t notify_end_rx); + + This is used to supply either the request part of a client call or the + reply part of a server call. msg.msg_iovlen and msg.msg_iov specify the + data buffers to be used. msg_iov may not be NULL and must point + exclusively to in-kernel virtual addresses. msg.msg_flags may be given + MSG_MORE if there will be subsequent data sends for this call. + + The msg must not specify a destination address, control data or any flags + other than MSG_MORE. len is the total amount of data to transmit. + + notify_end_rx can be NULL or it can be used to specify a function to be + called when the call changes state to end the Tx phase. This function is + called with the call-state spinlock held to prevent any reply or final ACK + from being delivered first. + + (#) Receive data from a call:: + + int rxrpc_kernel_recv_data(struct socket *sock, + struct rxrpc_call *call, + void *buf, + size_t size, + size_t *_offset, + bool want_more, + u32 *_abort, + u16 *_service) + + This is used to receive data from either the reply part of a client call + or the request part of a service call. buf and size specify how much + data is desired and where to store it. *_offset is added on to buf and + subtracted from size internally; the amount copied into the buffer is + added to *_offset before returning. + + want_more should be true if further data will be required after this is + satisfied and false if this is the last item of the receive phase. + + There are three normal returns: 0 if the buffer was filled and want_more + was true; 1 if the buffer was filled, the last DATA packet has been + emptied and want_more was false; and -EAGAIN if the function needs to be + called again. + + If the last DATA packet is processed but the buffer contains less than + the amount requested, EBADMSG is returned. If want_more wasn't set, but + more data was available, EMSGSIZE is returned. + + If a remote ABORT is detected, the abort code received will be stored in + ``*_abort`` and ECONNABORTED will be returned. + + The service ID that the call ended up with is returned into *_service. + This can be used to see if a call got a service upgrade. + + (#) Abort a call?? + + :: + + void rxrpc_kernel_abort_call(struct socket *sock, + struct rxrpc_call *call, + u32 abort_code); + + This is used to abort a call if it's still in an abortable state. The + abort code specified will be placed in the ABORT message sent. + + (#) Intercept received RxRPC messages:: + + typedef void (*rxrpc_interceptor_t)(struct sock *sk, + unsigned long user_call_ID, + struct sk_buff *skb); + + void + rxrpc_kernel_intercept_rx_messages(struct socket *sock, + rxrpc_interceptor_t interceptor); + + This installs an interceptor function on the specified AF_RXRPC socket. + All messages that would otherwise wind up in the socket's Rx queue are + then diverted to this function. Note that care must be taken to process + the messages in the right order to maintain DATA message sequentiality. + + The interceptor function itself is provided with the address of the socket + and handling the incoming message, the ID assigned by the kernel utility + to the call and the socket buffer containing the message. + + The skb->mark field indicates the type of message: + + =============================== ======================================= + Mark Meaning + =============================== ======================================= + RXRPC_SKB_MARK_DATA Data message + RXRPC_SKB_MARK_FINAL_ACK Final ACK received for an incoming call + RXRPC_SKB_MARK_BUSY Client call rejected as server busy + RXRPC_SKB_MARK_REMOTE_ABORT Call aborted by peer + RXRPC_SKB_MARK_NET_ERROR Network error detected + RXRPC_SKB_MARK_LOCAL_ERROR Local error encountered + RXRPC_SKB_MARK_NEW_CALL New incoming call awaiting acceptance + =============================== ======================================= + + The remote abort message can be probed with rxrpc_kernel_get_abort_code(). + The two error messages can be probed with rxrpc_kernel_get_error_number(). + A new call can be accepted with rxrpc_kernel_accept_call(). + + Data messages can have their contents extracted with the usual bunch of + socket buffer manipulation functions. A data message can be determined to + be the last one in a sequence with rxrpc_kernel_is_data_last(). When a + data message has been used up, rxrpc_kernel_data_consumed() should be + called on it. + + Messages should be handled to rxrpc_kernel_free_skb() to dispose of. It + is possible to get extra refs on all types of message for later freeing, + but this may pin the state of a call until the message is finally freed. + + (#) Accept an incoming call:: + + struct rxrpc_call * + rxrpc_kernel_accept_call(struct socket *sock, + unsigned long user_call_ID); + + This is used to accept an incoming call and to assign it a call ID. This + function is similar to rxrpc_kernel_begin_call() and calls accepted must + be ended in the same way. + + If this function is successful, an opaque reference to the RxRPC call is + returned. The caller now holds a reference on this and it must be + properly ended. + + (#) Reject an incoming call:: + + int rxrpc_kernel_reject_call(struct socket *sock); + + This is used to reject the first incoming call on the socket's queue with + a BUSY message. -ENODATA is returned if there were no incoming calls. + Other errors may be returned if the call had been aborted (-ECONNABORTED) + or had timed out (-ETIME). + + (#) Allocate a null key for doing anonymous security:: + + struct key *rxrpc_get_null_key(const char *keyname); + + This is used to allocate a null RxRPC key that can be used to indicate + anonymous security for a particular domain. + + (#) Get the peer address of a call:: + + void rxrpc_kernel_get_peer(struct socket *sock, struct rxrpc_call *call, + struct sockaddr_rxrpc *_srx); + + This is used to find the remote peer address of a call. + + (#) Set the total transmit data size on a call:: + + void rxrpc_kernel_set_tx_length(struct socket *sock, + struct rxrpc_call *call, + s64 tx_total_len); + + This sets the amount of data that the caller is intending to transmit on a + call. It's intended to be used for setting the reply size as the request + size should be set when the call is begun. tx_total_len may not be less + than zero. + + (#) Get call RTT:: + + u64 rxrpc_kernel_get_rtt(struct socket *sock, struct rxrpc_call *call); + + Get the RTT time to the peer in use by a call. The value returned is in + nanoseconds. + + (#) Check call still alive:: + + bool rxrpc_kernel_check_life(struct socket *sock, + struct rxrpc_call *call, + u32 *_life); + void rxrpc_kernel_probe_life(struct socket *sock, + struct rxrpc_call *call); + + The first function passes back in ``*_life`` a number that is updated when + ACKs are received from the peer (notably including PING RESPONSE ACKs + which we can elicit by sending PING ACKs to see if the call still exists + on the server). The caller should compare the numbers of two calls to see + if the call is still alive after waiting for a suitable interval. It also + returns true as long as the call hasn't yet reached the completed state. + + This allows the caller to work out if the server is still contactable and + if the call is still alive on the server while waiting for the server to + process a client operation. + + The second function causes a ping ACK to be transmitted to try to provoke + the peer into responding, which would then cause the value returned by the + first function to change. Note that this must be called in TASK_RUNNING + state. + + (#) Get reply timestamp:: + + bool rxrpc_kernel_get_reply_time(struct socket *sock, + struct rxrpc_call *call, + ktime_t *_ts) + + This allows the timestamp on the first DATA packet of the reply of a + client call to be queried, provided that it is still in the Rx ring. If + successful, the timestamp will be stored into ``*_ts`` and true will be + returned; false will be returned otherwise. + + (#) Get remote client epoch:: + + u32 rxrpc_kernel_get_epoch(struct socket *sock, + struct rxrpc_call *call) + + This allows the epoch that's contained in packets of an incoming client + call to be queried. This value is returned. The function always + successful if the call is still in progress. It shouldn't be called once + the call has expired. Note that calling this on a local client call only + returns the local epoch. + + This value can be used to determine if the remote client has been + restarted as it shouldn't change otherwise. + + (#) Set the maxmimum lifespan on a call:: + + void rxrpc_kernel_set_max_life(struct socket *sock, + struct rxrpc_call *call, + unsigned long hard_timeout) + + This sets the maximum lifespan on a call to hard_timeout (which is in + jiffies). In the event of the timeout occurring, the call will be + aborted and -ETIME or -ETIMEDOUT will be returned. + + +Configurable Parameters +======================= + +The RxRPC protocol driver has a number of configurable parameters that can be +adjusted through sysctls in /proc/net/rxrpc/: + + (#) req_ack_delay + + The amount of time in milliseconds after receiving a packet with the + request-ack flag set before we honour the flag and actually send the + requested ack. + + Usually the other side won't stop sending packets until the advertised + reception window is full (to a maximum of 255 packets), so delaying the + ACK permits several packets to be ACK'd in one go. + + (#) soft_ack_delay + + The amount of time in milliseconds after receiving a new packet before we + generate a soft-ACK to tell the sender that it doesn't need to resend. + + (#) idle_ack_delay + + The amount of time in milliseconds after all the packets currently in the + received queue have been consumed before we generate a hard-ACK to tell + the sender it can free its buffers, assuming no other reason occurs that + we would send an ACK. + + (#) resend_timeout + + The amount of time in milliseconds after transmitting a packet before we + transmit it again, assuming no ACK is received from the receiver telling + us they got it. + + (#) max_call_lifetime + + The maximum amount of time in seconds that a call may be in progress + before we preemptively kill it. + + (#) dead_call_expiry + + The amount of time in seconds before we remove a dead call from the call + list. Dead calls are kept around for a little while for the purpose of + repeating ACK and ABORT packets. + + (#) connection_expiry + + The amount of time in seconds after a connection was last used before we + remove it from the connection list. While a connection is in existence, + it serves as a placeholder for negotiated security; when it is deleted, + the security must be renegotiated. + + (#) transport_expiry + + The amount of time in seconds after a transport was last used before we + remove it from the transport list. While a transport is in existence, it + serves to anchor the peer data and keeps the connection ID counter. + + (#) rxrpc_rx_window_size + + The size of the receive window in packets. This is the maximum number of + unconsumed received packets we're willing to hold in memory for any + particular call. + + (#) rxrpc_rx_mtu + + The maximum packet MTU size that we're willing to receive in bytes. This + indicates to the peer whether we're willing to accept jumbo packets. + + (#) rxrpc_rx_jumbo_max + + The maximum number of packets that we're willing to accept in a jumbo + packet. Non-terminal packets in a jumbo packet must contain a four byte + header plus exactly 1412 bytes of data. The terminal packet must contain + a four byte header plus any amount of data. In any event, a jumbo packet + may not exceed rxrpc_rx_mtu in size. diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt deleted file mode 100644 index 180e07d956a7..000000000000 --- a/Documentation/networking/rxrpc.txt +++ /dev/null @@ -1,1155 +0,0 @@ - ====================== - RxRPC NETWORK PROTOCOL - ====================== - -The RxRPC protocol driver provides a reliable two-phase transport on top of UDP -that can be used to perform RxRPC remote operations. This is done over sockets -of AF_RXRPC family, using sendmsg() and recvmsg() with control data to send and -receive data, aborts and errors. - -Contents of this document: - - (*) Overview. - - (*) RxRPC protocol summary. - - (*) AF_RXRPC driver model. - - (*) Control messages. - - (*) Socket options. - - (*) Security. - - (*) Example client usage. - - (*) Example server usage. - - (*) AF_RXRPC kernel interface. - - (*) Configurable parameters. - - -======== -OVERVIEW -======== - -RxRPC is a two-layer protocol. There is a session layer which provides -reliable virtual connections using UDP over IPv4 (or IPv6) as the transport -layer, but implements a real network protocol; and there's the presentation -layer which renders structured data to binary blobs and back again using XDR -(as does SunRPC): - - +-------------+ - | Application | - +-------------+ - | XDR | Presentation - +-------------+ - | RxRPC | Session - +-------------+ - | UDP | Transport - +-------------+ - - -AF_RXRPC provides: - - (1) Part of an RxRPC facility for both kernel and userspace applications by - making the session part of it a Linux network protocol (AF_RXRPC). - - (2) A two-phase protocol. The client transmits a blob (the request) and then - receives a blob (the reply), and the server receives the request and then - transmits the reply. - - (3) Retention of the reusable bits of the transport system set up for one call - to speed up subsequent calls. - - (4) A secure protocol, using the Linux kernel's key retention facility to - manage security on the client end. The server end must of necessity be - more active in security negotiations. - -AF_RXRPC does not provide XDR marshalling/presentation facilities. That is -left to the application. AF_RXRPC only deals in blobs. Even the operation ID -is just the first four bytes of the request blob, and as such is beyond the -kernel's interest. - - -Sockets of AF_RXRPC family are: - - (1) created as type SOCK_DGRAM; - - (2) provided with a protocol of the type of underlying transport they're going - to use - currently only PF_INET is supported. - - -The Andrew File System (AFS) is an example of an application that uses this and -that has both kernel (filesystem) and userspace (utility) components. - - -====================== -RXRPC PROTOCOL SUMMARY -====================== - -An overview of the RxRPC protocol: - - (*) RxRPC sits on top of another networking protocol (UDP is the only option - currently), and uses this to provide network transport. UDP ports, for - example, provide transport endpoints. - - (*) RxRPC supports multiple virtual "connections" from any given transport - endpoint, thus allowing the endpoints to be shared, even to the same - remote endpoint. - - (*) Each connection goes to a particular "service". A connection may not go - to multiple services. A service may be considered the RxRPC equivalent of - a port number. AF_RXRPC permits multiple services to share an endpoint. - - (*) Client-originating packets are marked, thus a transport endpoint can be - shared between client and server connections (connections have a - direction). - - (*) Up to a billion connections may be supported concurrently between one - local transport endpoint and one service on one remote endpoint. An RxRPC - connection is described by seven numbers: - - Local address } - Local port } Transport (UDP) address - Remote address } - Remote port } - Direction - Connection ID - Service ID - - (*) Each RxRPC operation is a "call". A connection may make up to four - billion calls, but only up to four calls may be in progress on a - connection at any one time. - - (*) Calls are two-phase and asymmetric: the client sends its request data, - which the service receives; then the service transmits the reply data - which the client receives. - - (*) The data blobs are of indefinite size, the end of a phase is marked with a - flag in the packet. The number of packets of data making up one blob may - not exceed 4 billion, however, as this would cause the sequence number to - wrap. - - (*) The first four bytes of the request data are the service operation ID. - - (*) Security is negotiated on a per-connection basis. The connection is - initiated by the first data packet on it arriving. If security is - requested, the server then issues a "challenge" and then the client - replies with a "response". If the response is successful, the security is - set for the lifetime of that connection, and all subsequent calls made - upon it use that same security. In the event that the server lets a - connection lapse before the client, the security will be renegotiated if - the client uses the connection again. - - (*) Calls use ACK packets to handle reliability. Data packets are also - explicitly sequenced per call. - - (*) There are two types of positive acknowledgment: hard-ACKs and soft-ACKs. - A hard-ACK indicates to the far side that all the data received to a point - has been received and processed; a soft-ACK indicates that the data has - been received but may yet be discarded and re-requested. The sender may - not discard any transmittable packets until they've been hard-ACK'd. - - (*) Reception of a reply data packet implicitly hard-ACK's all the data - packets that make up the request. - - (*) An call is complete when the request has been sent, the reply has been - received and the final hard-ACK on the last packet of the reply has - reached the server. - - (*) An call may be aborted by either end at any time up to its completion. - - -===================== -AF_RXRPC DRIVER MODEL -===================== - -About the AF_RXRPC driver: - - (*) The AF_RXRPC protocol transparently uses internal sockets of the transport - protocol to represent transport endpoints. - - (*) AF_RXRPC sockets map onto RxRPC connection bundles. Actual RxRPC - connections are handled transparently. One client socket may be used to - make multiple simultaneous calls to the same service. One server socket - may handle calls from many clients. - - (*) Additional parallel client connections will be initiated to support extra - concurrent calls, up to a tunable limit. - - (*) Each connection is retained for a certain amount of time [tunable] after - the last call currently using it has completed in case a new call is made - that could reuse it. - - (*) Each internal UDP socket is retained [tunable] for a certain amount of - time [tunable] after the last connection using it discarded, in case a new - connection is made that could use it. - - (*) A client-side connection is only shared between calls if they have have - the same key struct describing their security (and assuming the calls - would otherwise share the connection). Non-secured calls would also be - able to share connections with each other. - - (*) A server-side connection is shared if the client says it is. - - (*) ACK'ing is handled by the protocol driver automatically, including ping - replying. - - (*) SO_KEEPALIVE automatically pings the other side to keep the connection - alive [TODO]. - - (*) If an ICMP error is received, all calls affected by that error will be - aborted with an appropriate network error passed through recvmsg(). - - -Interaction with the user of the RxRPC socket: - - (*) A socket is made into a server socket by binding an address with a - non-zero service ID. - - (*) In the client, sending a request is achieved with one or more sendmsgs, - followed by the reply being received with one or more recvmsgs. - - (*) The first sendmsg for a request to be sent from a client contains a tag to - be used in all other sendmsgs or recvmsgs associated with that call. The - tag is carried in the control data. - - (*) connect() is used to supply a default destination address for a client - socket. This may be overridden by supplying an alternate address to the - first sendmsg() of a call (struct msghdr::msg_name). - - (*) If connect() is called on an unbound client, a random local port will - bound before the operation takes place. - - (*) A server socket may also be used to make client calls. To do this, the - first sendmsg() of the call must specify the target address. The server's - transport endpoint is used to send the packets. - - (*) Once the application has received the last message associated with a call, - the tag is guaranteed not to be seen again, and so it can be used to pin - client resources. A new call can then be initiated with the same tag - without fear of interference. - - (*) In the server, a request is received with one or more recvmsgs, then the - the reply is transmitted with one or more sendmsgs, and then the final ACK - is received with a last recvmsg. - - (*) When sending data for a call, sendmsg is given MSG_MORE if there's more - data to come on that call. - - (*) When receiving data for a call, recvmsg flags MSG_MORE if there's more - data to come for that call. - - (*) When receiving data or messages for a call, MSG_EOR is flagged by recvmsg - to indicate the terminal message for that call. - - (*) A call may be aborted by adding an abort control message to the control - data. Issuing an abort terminates the kernel's use of that call's tag. - Any messages waiting in the receive queue for that call will be discarded. - - (*) Aborts, busy notifications and challenge packets are delivered by recvmsg, - and control data messages will be set to indicate the context. Receiving - an abort or a busy message terminates the kernel's use of that call's tag. - - (*) The control data part of the msghdr struct is used for a number of things: - - (*) The tag of the intended or affected call. - - (*) Sending or receiving errors, aborts and busy notifications. - - (*) Notifications of incoming calls. - - (*) Sending debug requests and receiving debug replies [TODO]. - - (*) When the kernel has received and set up an incoming call, it sends a - message to server application to let it know there's a new call awaiting - its acceptance [recvmsg reports a special control message]. The server - application then uses sendmsg to assign a tag to the new call. Once that - is done, the first part of the request data will be delivered by recvmsg. - - (*) The server application has to provide the server socket with a keyring of - secret keys corresponding to the security types it permits. When a secure - connection is being set up, the kernel looks up the appropriate secret key - in the keyring and then sends a challenge packet to the client and - receives a response packet. The kernel then checks the authorisation of - the packet and either aborts the connection or sets up the security. - - (*) The name of the key a client will use to secure its communications is - nominated by a socket option. - - -Notes on sendmsg: - - (*) MSG_WAITALL can be set to tell sendmsg to ignore signals if the peer is - making progress at accepting packets within a reasonable time such that we - manage to queue up all the data for transmission. This requires the - client to accept at least one packet per 2*RTT time period. - - If this isn't set, sendmsg() will return immediately, either returning - EINTR/ERESTARTSYS if nothing was consumed or returning the amount of data - consumed. - - -Notes on recvmsg: - - (*) If there's a sequence of data messages belonging to a particular call on - the receive queue, then recvmsg will keep working through them until: - - (a) it meets the end of that call's received data, - - (b) it meets a non-data message, - - (c) it meets a message belonging to a different call, or - - (d) it fills the user buffer. - - If recvmsg is called in blocking mode, it will keep sleeping, awaiting the - reception of further data, until one of the above four conditions is met. - - (2) MSG_PEEK operates similarly, but will return immediately if it has put any - data in the buffer rather than sleeping until it can fill the buffer. - - (3) If a data message is only partially consumed in filling a user buffer, - then the remainder of that message will be left on the front of the queue - for the next taker. MSG_TRUNC will never be flagged. - - (4) If there is more data to be had on a call (it hasn't copied the last byte - of the last data message in that phase yet), then MSG_MORE will be - flagged. - - -================ -CONTROL MESSAGES -================ - -AF_RXRPC makes use of control messages in sendmsg() and recvmsg() to multiplex -calls, to invoke certain actions and to report certain conditions. These are: - - MESSAGE ID SRT DATA MEANING - ======================= === =========== =============================== - RXRPC_USER_CALL_ID sr- User ID App's call specifier - RXRPC_ABORT srt Abort code Abort code to issue/received - RXRPC_ACK -rt n/a Final ACK received - RXRPC_NET_ERROR -rt error num Network error on call - RXRPC_BUSY -rt n/a Call rejected (server busy) - RXRPC_LOCAL_ERROR -rt error num Local error encountered - RXRPC_NEW_CALL -r- n/a New call received - RXRPC_ACCEPT s-- n/a Accept new call - RXRPC_EXCLUSIVE_CALL s-- n/a Make an exclusive client call - RXRPC_UPGRADE_SERVICE s-- n/a Client call can be upgraded - RXRPC_TX_LENGTH s-- data len Total length of Tx data - - (SRT = usable in Sendmsg / delivered by Recvmsg / Terminal message) - - (*) RXRPC_USER_CALL_ID - - This is used to indicate the application's call ID. It's an unsigned long - that the app specifies in the client by attaching it to the first data - message or in the server by passing it in association with an RXRPC_ACCEPT - message. recvmsg() passes it in conjunction with all messages except - those of the RXRPC_NEW_CALL message. - - (*) RXRPC_ABORT - - This is can be used by an application to abort a call by passing it to - sendmsg, or it can be delivered by recvmsg to indicate a remote abort was - received. Either way, it must be associated with an RXRPC_USER_CALL_ID to - specify the call affected. If an abort is being sent, then error EBADSLT - will be returned if there is no call with that user ID. - - (*) RXRPC_ACK - - This is delivered to a server application to indicate that the final ACK - of a call was received from the client. It will be associated with an - RXRPC_USER_CALL_ID to indicate the call that's now complete. - - (*) RXRPC_NET_ERROR - - This is delivered to an application to indicate that an ICMP error message - was encountered in the process of trying to talk to the peer. An - errno-class integer value will be included in the control message data - indicating the problem, and an RXRPC_USER_CALL_ID will indicate the call - affected. - - (*) RXRPC_BUSY - - This is delivered to a client application to indicate that a call was - rejected by the server due to the server being busy. It will be - associated with an RXRPC_USER_CALL_ID to indicate the rejected call. - - (*) RXRPC_LOCAL_ERROR - - This is delivered to an application to indicate that a local error was - encountered and that a call has been aborted because of it. An - errno-class integer value will be included in the control message data - indicating the problem, and an RXRPC_USER_CALL_ID will indicate the call - affected. - - (*) RXRPC_NEW_CALL - - This is delivered to indicate to a server application that a new call has - arrived and is awaiting acceptance. No user ID is associated with this, - as a user ID must subsequently be assigned by doing an RXRPC_ACCEPT. - - (*) RXRPC_ACCEPT - - This is used by a server application to attempt to accept a call and - assign it a user ID. It should be associated with an RXRPC_USER_CALL_ID - to indicate the user ID to be assigned. If there is no call to be - accepted (it may have timed out, been aborted, etc.), then sendmsg will - return error ENODATA. If the user ID is already in use by another call, - then error EBADSLT will be returned. - - (*) RXRPC_EXCLUSIVE_CALL - - This is used to indicate that a client call should be made on a one-off - connection. The connection is discarded once the call has terminated. - - (*) RXRPC_UPGRADE_SERVICE - - This is used to make a client call to probe if the specified service ID - may be upgraded by the server. The caller must check msg_name returned to - recvmsg() for the service ID actually in use. The operation probed must - be one that takes the same arguments in both services. - - Once this has been used to establish the upgrade capability (or lack - thereof) of the server, the service ID returned should be used for all - future communication to that server and RXRPC_UPGRADE_SERVICE should no - longer be set. - - (*) RXRPC_TX_LENGTH - - This is used to inform the kernel of the total amount of data that is - going to be transmitted by a call (whether in a client request or a - service response). If given, it allows the kernel to encrypt from the - userspace buffer directly to the packet buffers, rather than copying into - the buffer and then encrypting in place. This may only be given with the - first sendmsg() providing data for a call. EMSGSIZE will be generated if - the amount of data actually given is different. - - This takes a parameter of __s64 type that indicates how much will be - transmitted. This may not be less than zero. - -The symbol RXRPC__SUPPORTED is defined as one more than the highest control -message type supported. At run time this can be queried by means of the -RXRPC_SUPPORTED_CMSG socket option (see below). - - -============== -SOCKET OPTIONS -============== - -AF_RXRPC sockets support a few socket options at the SOL_RXRPC level: - - (*) RXRPC_SECURITY_KEY - - This is used to specify the description of the key to be used. The key is - extracted from the calling process's keyrings with request_key() and - should be of "rxrpc" type. - - The optval pointer points to the description string, and optlen indicates - how long the string is, without the NUL terminator. - - (*) RXRPC_SECURITY_KEYRING - - Similar to above but specifies a keyring of server secret keys to use (key - type "keyring"). See the "Security" section. - - (*) RXRPC_EXCLUSIVE_CONNECTION - - This is used to request that new connections should be used for each call - made subsequently on this socket. optval should be NULL and optlen 0. - - (*) RXRPC_MIN_SECURITY_LEVEL - - This is used to specify the minimum security level required for calls on - this socket. optval must point to an int containing one of the following - values: - - (a) RXRPC_SECURITY_PLAIN - - Encrypted checksum only. - - (b) RXRPC_SECURITY_AUTH - - Encrypted checksum plus packet padded and first eight bytes of packet - encrypted - which includes the actual packet length. - - (c) RXRPC_SECURITY_ENCRYPTED - - Encrypted checksum plus entire packet padded and encrypted, including - actual packet length. - - (*) RXRPC_UPGRADEABLE_SERVICE - - This is used to indicate that a service socket with two bindings may - upgrade one bound service to the other if requested by the client. optval - must point to an array of two unsigned short ints. The first is the - service ID to upgrade from and the second the service ID to upgrade to. - - (*) RXRPC_SUPPORTED_CMSG - - This is a read-only option that writes an int into the buffer indicating - the highest control message type supported. - - -======== -SECURITY -======== - -Currently, only the kerberos 4 equivalent protocol has been implemented -(security index 2 - rxkad). This requires the rxkad module to be loaded and, -on the client, tickets of the appropriate type to be obtained from the AFS -kaserver or the kerberos server and installed as "rxrpc" type keys. This is -normally done using the klog program. An example simple klog program can be -found at: - - http://people.redhat.com/~dhowells/rxrpc/klog.c - -The payload provided to add_key() on the client should be of the following -form: - - struct rxrpc_key_sec2_v1 { - uint16_t security_index; /* 2 */ - uint16_t ticket_length; /* length of ticket[] */ - uint32_t expiry; /* time at which expires */ - uint8_t kvno; /* key version number */ - uint8_t __pad[3]; - uint8_t session_key[8]; /* DES session key */ - uint8_t ticket[0]; /* the encrypted ticket */ - }; - -Where the ticket blob is just appended to the above structure. - - -For the server, keys of type "rxrpc_s" must be made available to the server. -They have a description of ":" (eg: "52:2" for an -rxkad key for the AFS VL service). When such a key is created, it should be -given the server's secret key as the instantiation data (see the example -below). - - add_key("rxrpc_s", "52:2", secret_key, 8, keyring); - -A keyring is passed to the server socket by naming it in a sockopt. The server -socket then looks the server secret keys up in this keyring when secure -incoming connections are made. This can be seen in an example program that can -be found at: - - http://people.redhat.com/~dhowells/rxrpc/listen.c - - -==================== -EXAMPLE CLIENT USAGE -==================== - -A client would issue an operation by: - - (1) An RxRPC socket is set up by: - - client = socket(AF_RXRPC, SOCK_DGRAM, PF_INET); - - Where the third parameter indicates the protocol family of the transport - socket used - usually IPv4 but it can also be IPv6 [TODO]. - - (2) A local address can optionally be bound: - - struct sockaddr_rxrpc srx = { - .srx_family = AF_RXRPC, - .srx_service = 0, /* we're a client */ - .transport_type = SOCK_DGRAM, /* type of transport socket */ - .transport.sin_family = AF_INET, - .transport.sin_port = htons(7000), /* AFS callback */ - .transport.sin_address = 0, /* all local interfaces */ - }; - bind(client, &srx, sizeof(srx)); - - This specifies the local UDP port to be used. If not given, a random - non-privileged port will be used. A UDP port may be shared between - several unrelated RxRPC sockets. Security is handled on a basis of - per-RxRPC virtual connection. - - (3) The security is set: - - const char *key = "AFS:cambridge.redhat.com"; - setsockopt(client, SOL_RXRPC, RXRPC_SECURITY_KEY, key, strlen(key)); - - This issues a request_key() to get the key representing the security - context. The minimum security level can be set: - - unsigned int sec = RXRPC_SECURITY_ENCRYPTED; - setsockopt(client, SOL_RXRPC, RXRPC_MIN_SECURITY_LEVEL, - &sec, sizeof(sec)); - - (4) The server to be contacted can then be specified (alternatively this can - be done through sendmsg): - - struct sockaddr_rxrpc srx = { - .srx_family = AF_RXRPC, - .srx_service = VL_SERVICE_ID, - .transport_type = SOCK_DGRAM, /* type of transport socket */ - .transport.sin_family = AF_INET, - .transport.sin_port = htons(7005), /* AFS volume manager */ - .transport.sin_address = ..., - }; - connect(client, &srx, sizeof(srx)); - - (5) The request data should then be posted to the server socket using a series - of sendmsg() calls, each with the following control message attached: - - RXRPC_USER_CALL_ID - specifies the user ID for this call - - MSG_MORE should be set in msghdr::msg_flags on all but the last part of - the request. Multiple requests may be made simultaneously. - - An RXRPC_TX_LENGTH control message can also be specified on the first - sendmsg() call. - - If a call is intended to go to a destination other than the default - specified through connect(), then msghdr::msg_name should be set on the - first request message of that call. - - (6) The reply data will then be posted to the server socket for recvmsg() to - pick up. MSG_MORE will be flagged by recvmsg() if there's more reply data - for a particular call to be read. MSG_EOR will be set on the terminal - read for a call. - - All data will be delivered with the following control message attached: - - RXRPC_USER_CALL_ID - specifies the user ID for this call - - If an abort or error occurred, this will be returned in the control data - buffer instead, and MSG_EOR will be flagged to indicate the end of that - call. - -A client may ask for a service ID it knows and ask that this be upgraded to a -better service if one is available by supplying RXRPC_UPGRADE_SERVICE on the -first sendmsg() of a call. The client should then check srx_service in the -msg_name filled in by recvmsg() when collecting the result. srx_service will -hold the same value as given to sendmsg() if the upgrade request was ignored by -the service - otherwise it will be altered to indicate the service ID the -server upgraded to. Note that the upgraded service ID is chosen by the server. -The caller has to wait until it sees the service ID in the reply before sending -any more calls (further calls to the same destination will be blocked until the -probe is concluded). - - -==================== -EXAMPLE SERVER USAGE -==================== - -A server would be set up to accept operations in the following manner: - - (1) An RxRPC socket is created by: - - server = socket(AF_RXRPC, SOCK_DGRAM, PF_INET); - - Where the third parameter indicates the address type of the transport - socket used - usually IPv4. - - (2) Security is set up if desired by giving the socket a keyring with server - secret keys in it: - - keyring = add_key("keyring", "AFSkeys", NULL, 0, - KEY_SPEC_PROCESS_KEYRING); - - const char secret_key[8] = { - 0xa7, 0x83, 0x8a, 0xcb, 0xc7, 0x83, 0xec, 0x94 }; - add_key("rxrpc_s", "52:2", secret_key, 8, keyring); - - setsockopt(server, SOL_RXRPC, RXRPC_SECURITY_KEYRING, "AFSkeys", 7); - - The keyring can be manipulated after it has been given to the socket. This - permits the server to add more keys, replace keys, etc. while it is live. - - (3) A local address must then be bound: - - struct sockaddr_rxrpc srx = { - .srx_family = AF_RXRPC, - .srx_service = VL_SERVICE_ID, /* RxRPC service ID */ - .transport_type = SOCK_DGRAM, /* type of transport socket */ - .transport.sin_family = AF_INET, - .transport.sin_port = htons(7000), /* AFS callback */ - .transport.sin_address = 0, /* all local interfaces */ - }; - bind(server, &srx, sizeof(srx)); - - More than one service ID may be bound to a socket, provided the transport - parameters are the same. The limit is currently two. To do this, bind() - should be called twice. - - (4) If service upgrading is required, first two service IDs must have been - bound and then the following option must be set: - - unsigned short service_ids[2] = { from_ID, to_ID }; - setsockopt(server, SOL_RXRPC, RXRPC_UPGRADEABLE_SERVICE, - service_ids, sizeof(service_ids)); - - This will automatically upgrade connections on service from_ID to service - to_ID if they request it. This will be reflected in msg_name obtained - through recvmsg() when the request data is delivered to userspace. - - (5) The server is then set to listen out for incoming calls: - - listen(server, 100); - - (6) The kernel notifies the server of pending incoming connections by sending - it a message for each. This is received with recvmsg() on the server - socket. It has no data, and has a single dataless control message - attached: - - RXRPC_NEW_CALL - - The address that can be passed back by recvmsg() at this point should be - ignored since the call for which the message was posted may have gone by - the time it is accepted - in which case the first call still on the queue - will be accepted. - - (7) The server then accepts the new call by issuing a sendmsg() with two - pieces of control data and no actual data: - - RXRPC_ACCEPT - indicate connection acceptance - RXRPC_USER_CALL_ID - specify user ID for this call - - (8) The first request data packet will then be posted to the server socket for - recvmsg() to pick up. At that point, the RxRPC address for the call can - be read from the address fields in the msghdr struct. - - Subsequent request data will be posted to the server socket for recvmsg() - to collect as it arrives. All but the last piece of the request data will - be delivered with MSG_MORE flagged. - - All data will be delivered with the following control message attached: - - RXRPC_USER_CALL_ID - specifies the user ID for this call - - (9) The reply data should then be posted to the server socket using a series - of sendmsg() calls, each with the following control messages attached: - - RXRPC_USER_CALL_ID - specifies the user ID for this call - - MSG_MORE should be set in msghdr::msg_flags on all but the last message - for a particular call. - -(10) The final ACK from the client will be posted for retrieval by recvmsg() - when it is received. It will take the form of a dataless message with two - control messages attached: - - RXRPC_USER_CALL_ID - specifies the user ID for this call - RXRPC_ACK - indicates final ACK (no data) - - MSG_EOR will be flagged to indicate that this is the final message for - this call. - -(11) Up to the point the final packet of reply data is sent, the call can be - aborted by calling sendmsg() with a dataless message with the following - control messages attached: - - RXRPC_USER_CALL_ID - specifies the user ID for this call - RXRPC_ABORT - indicates abort code (4 byte data) - - Any packets waiting in the socket's receive queue will be discarded if - this is issued. - -Note that all the communications for a particular service take place through -the one server socket, using control messages on sendmsg() and recvmsg() to -determine the call affected. - - -========================= -AF_RXRPC KERNEL INTERFACE -========================= - -The AF_RXRPC module also provides an interface for use by in-kernel utilities -such as the AFS filesystem. This permits such a utility to: - - (1) Use different keys directly on individual client calls on one socket - rather than having to open a whole slew of sockets, one for each key it - might want to use. - - (2) Avoid having RxRPC call request_key() at the point of issue of a call or - opening of a socket. Instead the utility is responsible for requesting a - key at the appropriate point. AFS, for instance, would do this during VFS - operations such as open() or unlink(). The key is then handed through - when the call is initiated. - - (3) Request the use of something other than GFP_KERNEL to allocate memory. - - (4) Avoid the overhead of using the recvmsg() call. RxRPC messages can be - intercepted before they get put into the socket Rx queue and the socket - buffers manipulated directly. - -To use the RxRPC facility, a kernel utility must still open an AF_RXRPC socket, -bind an address as appropriate and listen if it's to be a server socket, but -then it passes this to the kernel interface functions. - -The kernel interface functions are as follows: - - (*) Begin a new client call. - - struct rxrpc_call * - rxrpc_kernel_begin_call(struct socket *sock, - struct sockaddr_rxrpc *srx, - struct key *key, - unsigned long user_call_ID, - s64 tx_total_len, - gfp_t gfp, - rxrpc_notify_rx_t notify_rx, - bool upgrade, - bool intr, - unsigned int debug_id); - - This allocates the infrastructure to make a new RxRPC call and assigns - call and connection numbers. The call will be made on the UDP port that - the socket is bound to. The call will go to the destination address of a - connected client socket unless an alternative is supplied (srx is - non-NULL). - - If a key is supplied then this will be used to secure the call instead of - the key bound to the socket with the RXRPC_SECURITY_KEY sockopt. Calls - secured in this way will still share connections if at all possible. - - The user_call_ID is equivalent to that supplied to sendmsg() in the - control data buffer. It is entirely feasible to use this to point to a - kernel data structure. - - tx_total_len is the amount of data the caller is intending to transmit - with this call (or -1 if unknown at this point). Setting the data size - allows the kernel to encrypt directly to the packet buffers, thereby - saving a copy. The value may not be less than -1. - - notify_rx is a pointer to a function to be called when events such as - incoming data packets or remote aborts happen. - - upgrade should be set to true if a client operation should request that - the server upgrade the service to a better one. The resultant service ID - is returned by rxrpc_kernel_recv_data(). - - intr should be set to true if the call should be interruptible. If this - is not set, this function may not return until a channel has been - allocated; if it is set, the function may return -ERESTARTSYS. - - debug_id is the call debugging ID to be used for tracing. This can be - obtained by atomically incrementing rxrpc_debug_id. - - If this function is successful, an opaque reference to the RxRPC call is - returned. The caller now holds a reference on this and it must be - properly ended. - - (*) End a client call. - - void rxrpc_kernel_end_call(struct socket *sock, - struct rxrpc_call *call); - - This is used to end a previously begun call. The user_call_ID is expunged - from AF_RXRPC's knowledge and will not be seen again in association with - the specified call. - - (*) Send data through a call. - - typedef void (*rxrpc_notify_end_tx_t)(struct sock *sk, - unsigned long user_call_ID, - struct sk_buff *skb); - - int rxrpc_kernel_send_data(struct socket *sock, - struct rxrpc_call *call, - struct msghdr *msg, - size_t len, - rxrpc_notify_end_tx_t notify_end_rx); - - This is used to supply either the request part of a client call or the - reply part of a server call. msg.msg_iovlen and msg.msg_iov specify the - data buffers to be used. msg_iov may not be NULL and must point - exclusively to in-kernel virtual addresses. msg.msg_flags may be given - MSG_MORE if there will be subsequent data sends for this call. - - The msg must not specify a destination address, control data or any flags - other than MSG_MORE. len is the total amount of data to transmit. - - notify_end_rx can be NULL or it can be used to specify a function to be - called when the call changes state to end the Tx phase. This function is - called with the call-state spinlock held to prevent any reply or final ACK - from being delivered first. - - (*) Receive data from a call. - - int rxrpc_kernel_recv_data(struct socket *sock, - struct rxrpc_call *call, - void *buf, - size_t size, - size_t *_offset, - bool want_more, - u32 *_abort, - u16 *_service) - - This is used to receive data from either the reply part of a client call - or the request part of a service call. buf and size specify how much - data is desired and where to store it. *_offset is added on to buf and - subtracted from size internally; the amount copied into the buffer is - added to *_offset before returning. - - want_more should be true if further data will be required after this is - satisfied and false if this is the last item of the receive phase. - - There are three normal returns: 0 if the buffer was filled and want_more - was true; 1 if the buffer was filled, the last DATA packet has been - emptied and want_more was false; and -EAGAIN if the function needs to be - called again. - - If the last DATA packet is processed but the buffer contains less than - the amount requested, EBADMSG is returned. If want_more wasn't set, but - more data was available, EMSGSIZE is returned. - - If a remote ABORT is detected, the abort code received will be stored in - *_abort and ECONNABORTED will be returned. - - The service ID that the call ended up with is returned into *_service. - This can be used to see if a call got a service upgrade. - - (*) Abort a call. - - void rxrpc_kernel_abort_call(struct socket *sock, - struct rxrpc_call *call, - u32 abort_code); - - This is used to abort a call if it's still in an abortable state. The - abort code specified will be placed in the ABORT message sent. - - (*) Intercept received RxRPC messages. - - typedef void (*rxrpc_interceptor_t)(struct sock *sk, - unsigned long user_call_ID, - struct sk_buff *skb); - - void - rxrpc_kernel_intercept_rx_messages(struct socket *sock, - rxrpc_interceptor_t interceptor); - - This installs an interceptor function on the specified AF_RXRPC socket. - All messages that would otherwise wind up in the socket's Rx queue are - then diverted to this function. Note that care must be taken to process - the messages in the right order to maintain DATA message sequentiality. - - The interceptor function itself is provided with the address of the socket - and handling the incoming message, the ID assigned by the kernel utility - to the call and the socket buffer containing the message. - - The skb->mark field indicates the type of message: - - MARK MEANING - =============================== ======================================= - RXRPC_SKB_MARK_DATA Data message - RXRPC_SKB_MARK_FINAL_ACK Final ACK received for an incoming call - RXRPC_SKB_MARK_BUSY Client call rejected as server busy - RXRPC_SKB_MARK_REMOTE_ABORT Call aborted by peer - RXRPC_SKB_MARK_NET_ERROR Network error detected - RXRPC_SKB_MARK_LOCAL_ERROR Local error encountered - RXRPC_SKB_MARK_NEW_CALL New incoming call awaiting acceptance - - The remote abort message can be probed with rxrpc_kernel_get_abort_code(). - The two error messages can be probed with rxrpc_kernel_get_error_number(). - A new call can be accepted with rxrpc_kernel_accept_call(). - - Data messages can have their contents extracted with the usual bunch of - socket buffer manipulation functions. A data message can be determined to - be the last one in a sequence with rxrpc_kernel_is_data_last(). When a - data message has been used up, rxrpc_kernel_data_consumed() should be - called on it. - - Messages should be handled to rxrpc_kernel_free_skb() to dispose of. It - is possible to get extra refs on all types of message for later freeing, - but this may pin the state of a call until the message is finally freed. - - (*) Accept an incoming call. - - struct rxrpc_call * - rxrpc_kernel_accept_call(struct socket *sock, - unsigned long user_call_ID); - - This is used to accept an incoming call and to assign it a call ID. This - function is similar to rxrpc_kernel_begin_call() and calls accepted must - be ended in the same way. - - If this function is successful, an opaque reference to the RxRPC call is - returned. The caller now holds a reference on this and it must be - properly ended. - - (*) Reject an incoming call. - - int rxrpc_kernel_reject_call(struct socket *sock); - - This is used to reject the first incoming call on the socket's queue with - a BUSY message. -ENODATA is returned if there were no incoming calls. - Other errors may be returned if the call had been aborted (-ECONNABORTED) - or had timed out (-ETIME). - - (*) Allocate a null key for doing anonymous security. - - struct key *rxrpc_get_null_key(const char *keyname); - - This is used to allocate a null RxRPC key that can be used to indicate - anonymous security for a particular domain. - - (*) Get the peer address of a call. - - void rxrpc_kernel_get_peer(struct socket *sock, struct rxrpc_call *call, - struct sockaddr_rxrpc *_srx); - - This is used to find the remote peer address of a call. - - (*) Set the total transmit data size on a call. - - void rxrpc_kernel_set_tx_length(struct socket *sock, - struct rxrpc_call *call, - s64 tx_total_len); - - This sets the amount of data that the caller is intending to transmit on a - call. It's intended to be used for setting the reply size as the request - size should be set when the call is begun. tx_total_len may not be less - than zero. - - (*) Get call RTT. - - u64 rxrpc_kernel_get_rtt(struct socket *sock, struct rxrpc_call *call); - - Get the RTT time to the peer in use by a call. The value returned is in - nanoseconds. - - (*) Check call still alive. - - bool rxrpc_kernel_check_life(struct socket *sock, - struct rxrpc_call *call, - u32 *_life); - void rxrpc_kernel_probe_life(struct socket *sock, - struct rxrpc_call *call); - - The first function passes back in *_life a number that is updated when - ACKs are received from the peer (notably including PING RESPONSE ACKs - which we can elicit by sending PING ACKs to see if the call still exists - on the server). The caller should compare the numbers of two calls to see - if the call is still alive after waiting for a suitable interval. It also - returns true as long as the call hasn't yet reached the completed state. - - This allows the caller to work out if the server is still contactable and - if the call is still alive on the server while waiting for the server to - process a client operation. - - The second function causes a ping ACK to be transmitted to try to provoke - the peer into responding, which would then cause the value returned by the - first function to change. Note that this must be called in TASK_RUNNING - state. - - (*) Get reply timestamp. - - bool rxrpc_kernel_get_reply_time(struct socket *sock, - struct rxrpc_call *call, - ktime_t *_ts) - - This allows the timestamp on the first DATA packet of the reply of a - client call to be queried, provided that it is still in the Rx ring. If - successful, the timestamp will be stored into *_ts and true will be - returned; false will be returned otherwise. - - (*) Get remote client epoch. - - u32 rxrpc_kernel_get_epoch(struct socket *sock, - struct rxrpc_call *call) - - This allows the epoch that's contained in packets of an incoming client - call to be queried. This value is returned. The function always - successful if the call is still in progress. It shouldn't be called once - the call has expired. Note that calling this on a local client call only - returns the local epoch. - - This value can be used to determine if the remote client has been - restarted as it shouldn't change otherwise. - - (*) Set the maxmimum lifespan on a call. - - void rxrpc_kernel_set_max_life(struct socket *sock, - struct rxrpc_call *call, - unsigned long hard_timeout) - - This sets the maximum lifespan on a call to hard_timeout (which is in - jiffies). In the event of the timeout occurring, the call will be - aborted and -ETIME or -ETIMEDOUT will be returned. - - -======================= -CONFIGURABLE PARAMETERS -======================= - -The RxRPC protocol driver has a number of configurable parameters that can be -adjusted through sysctls in /proc/net/rxrpc/: - - (*) req_ack_delay - - The amount of time in milliseconds after receiving a packet with the - request-ack flag set before we honour the flag and actually send the - requested ack. - - Usually the other side won't stop sending packets until the advertised - reception window is full (to a maximum of 255 packets), so delaying the - ACK permits several packets to be ACK'd in one go. - - (*) soft_ack_delay - - The amount of time in milliseconds after receiving a new packet before we - generate a soft-ACK to tell the sender that it doesn't need to resend. - - (*) idle_ack_delay - - The amount of time in milliseconds after all the packets currently in the - received queue have been consumed before we generate a hard-ACK to tell - the sender it can free its buffers, assuming no other reason occurs that - we would send an ACK. - - (*) resend_timeout - - The amount of time in milliseconds after transmitting a packet before we - transmit it again, assuming no ACK is received from the receiver telling - us they got it. - - (*) max_call_lifetime - - The maximum amount of time in seconds that a call may be in progress - before we preemptively kill it. - - (*) dead_call_expiry - - The amount of time in seconds before we remove a dead call from the call - list. Dead calls are kept around for a little while for the purpose of - repeating ACK and ABORT packets. - - (*) connection_expiry - - The amount of time in seconds after a connection was last used before we - remove it from the connection list. While a connection is in existence, - it serves as a placeholder for negotiated security; when it is deleted, - the security must be renegotiated. - - (*) transport_expiry - - The amount of time in seconds after a transport was last used before we - remove it from the transport list. While a transport is in existence, it - serves to anchor the peer data and keeps the connection ID counter. - - (*) rxrpc_rx_window_size - - The size of the receive window in packets. This is the maximum number of - unconsumed received packets we're willing to hold in memory for any - particular call. - - (*) rxrpc_rx_mtu - - The maximum packet MTU size that we're willing to receive in bytes. This - indicates to the peer whether we're willing to accept jumbo packets. - - (*) rxrpc_rx_jumbo_max - - The maximum number of packets that we're willing to accept in a jumbo - packet. Non-terminal packets in a jumbo packet must contain a four byte - header plus exactly 1412 bytes of data. The terminal packet must contain - a four byte header plus any amount of data. In any event, a jumbo packet - may not exceed rxrpc_rx_mtu in size. diff --git a/MAINTAINERS b/MAINTAINERS index b28823ab48c5..866a0dcd66ef 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14593,7 +14593,7 @@ M: David Howells L: linux-afs@lists.infradead.org S: Supported W: https://www.infradead.org/~dhowells/kafs/ -F: Documentation/networking/rxrpc.txt +F: Documentation/networking/rxrpc.rst F: include/keys/rxrpc-type.h F: include/net/af_rxrpc.h F: include/trace/events/rxrpc.h diff --git a/net/rxrpc/Kconfig b/net/rxrpc/Kconfig index 57ebb29c26ad..d706bb408365 100644 --- a/net/rxrpc/Kconfig +++ b/net/rxrpc/Kconfig @@ -18,7 +18,7 @@ config AF_RXRPC This module at the moment only supports client operations and is currently incomplete. - See Documentation/networking/rxrpc.txt. + See Documentation/networking/rxrpc.rst. config AF_RXRPC_IPV6 bool "IPv6 support for RxRPC" @@ -41,7 +41,7 @@ config AF_RXRPC_DEBUG help Say Y here to make runtime controllable debugging messages appear. - See Documentation/networking/rxrpc.txt. + See Documentation/networking/rxrpc.rst. config RXKAD @@ -56,4 +56,4 @@ config RXKAD Provide kerberos 4 and AFS kaserver security handling for AF_RXRPC through the use of the key retention service. - See Documentation/networking/rxrpc.txt. + See Documentation/networking/rxrpc.rst. diff --git a/net/rxrpc/sysctl.c b/net/rxrpc/sysctl.c index 2bbb38161851..174e903e18de 100644 --- a/net/rxrpc/sysctl.c +++ b/net/rxrpc/sysctl.c @@ -21,7 +21,7 @@ static const unsigned long max_jiffies = MAX_JIFFY_OFFSET; /* * RxRPC operating parameters. * - * See Documentation/networking/rxrpc.txt and the variable definitions for more + * See Documentation/networking/rxrpc.rst and the variable definitions for more * information on the individual parameters. */ static struct ctl_table rxrpc_sysctl_table[] = { -- cgit v1.2.3 From 671d114d8cde3ba4390714b850c86d8b39d31009 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:22 +0200 Subject: docs: networking: convert sctp.txt to ReST - add SPDX header; - add a document title; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: Marcelo Ricardo Leitner Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/sctp.rst | 42 ++++++++++++++++++++++++++++++++++++++ Documentation/networking/sctp.txt | 35 ------------------------------- MAINTAINERS | 2 +- 4 files changed, 44 insertions(+), 36 deletions(-) create mode 100644 Documentation/networking/sctp.rst delete mode 100644 Documentation/networking/sctp.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index cd307b9601fa..1761eb715061 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -100,6 +100,7 @@ Contents: rds regulatory rxrpc + sctp .. only:: subproject and html diff --git a/Documentation/networking/sctp.rst b/Documentation/networking/sctp.rst new file mode 100644 index 000000000000..9f4d9c8a925b --- /dev/null +++ b/Documentation/networking/sctp.rst @@ -0,0 +1,42 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +Linux Kernel SCTP +================= + +This is the current BETA release of the Linux Kernel SCTP reference +implementation. + +SCTP (Stream Control Transmission Protocol) is a IP based, message oriented, +reliable transport protocol, with congestion control, support for +transparent multi-homing, and multiple ordered streams of messages. +RFC2960 defines the core protocol. The IETF SIGTRAN working group originally +developed the SCTP protocol and later handed the protocol over to the +Transport Area (TSVWG) working group for the continued evolvement of SCTP as a +general purpose transport. + +See the IETF website (http://www.ietf.org) for further documents on SCTP. +See http://www.ietf.org/rfc/rfc2960.txt + +The initial project goal is to create an Linux kernel reference implementation +of SCTP that is RFC 2960 compliant and provides an programming interface +referred to as the UDP-style API of the Sockets Extensions for SCTP, as +proposed in IETF Internet-Drafts. + +Caveats +======= + +- lksctp can be built as statically or as a module. However, be aware that + module removal of lksctp is not yet a safe activity. + +- There is tentative support for IPv6, but most work has gone towards + implementation and testing lksctp on IPv4. + + +For more information, please visit the lksctp project website: + + http://www.sf.net/projects/lksctp + +Or contact the lksctp developers through the mailing list: + + diff --git a/Documentation/networking/sctp.txt b/Documentation/networking/sctp.txt deleted file mode 100644 index 97b810ca9082..000000000000 --- a/Documentation/networking/sctp.txt +++ /dev/null @@ -1,35 +0,0 @@ -Linux Kernel SCTP - -This is the current BETA release of the Linux Kernel SCTP reference -implementation. - -SCTP (Stream Control Transmission Protocol) is a IP based, message oriented, -reliable transport protocol, with congestion control, support for -transparent multi-homing, and multiple ordered streams of messages. -RFC2960 defines the core protocol. The IETF SIGTRAN working group originally -developed the SCTP protocol and later handed the protocol over to the -Transport Area (TSVWG) working group for the continued evolvement of SCTP as a -general purpose transport. - -See the IETF website (http://www.ietf.org) for further documents on SCTP. -See http://www.ietf.org/rfc/rfc2960.txt - -The initial project goal is to create an Linux kernel reference implementation -of SCTP that is RFC 2960 compliant and provides an programming interface -referred to as the UDP-style API of the Sockets Extensions for SCTP, as -proposed in IETF Internet-Drafts. - -Caveats: - --lksctp can be built as statically or as a module. However, be aware that -module removal of lksctp is not yet a safe activity. - --There is tentative support for IPv6, but most work has gone towards -implementation and testing lksctp on IPv4. - - -For more information, please visit the lksctp project website: - http://www.sf.net/projects/lksctp - -Or contact the lksctp developers through the mailing list: - diff --git a/MAINTAINERS b/MAINTAINERS index 866a0dcd66ef..0ac9cec0bce6 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14999,7 +14999,7 @@ M: Marcelo Ricardo Leitner L: linux-sctp@vger.kernel.org S: Maintained W: http://lksctp.sourceforge.net -F: Documentation/networking/sctp.txt +F: Documentation/networking/sctp.rst F: include/linux/sctp.h F: include/net/sctp/ F: include/uapi/linux/sctp.h -- cgit v1.2.3 From de1fd4a7b0f2351b775b673f092430dff64b221e Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:23 +0200 Subject: docs: networking: convert secid.txt to ReST Not much to be done here: - add SPDX header; - add a document title; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/secid.rst | 20 ++++++++++++++++++++ Documentation/networking/secid.txt | 14 -------------- 3 files changed, 21 insertions(+), 14 deletions(-) create mode 100644 Documentation/networking/secid.rst delete mode 100644 Documentation/networking/secid.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 1761eb715061..8b672f252f67 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -101,6 +101,7 @@ Contents: regulatory rxrpc sctp + secid .. only:: subproject and html diff --git a/Documentation/networking/secid.rst b/Documentation/networking/secid.rst new file mode 100644 index 000000000000..b45141a98027 --- /dev/null +++ b/Documentation/networking/secid.rst @@ -0,0 +1,20 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +LSM/SeLinux secid +================= + +flowi structure: + +The secid member in the flow structure is used in LSMs (e.g. SELinux) to indicate +the label of the flow. This label of the flow is currently used in selecting +matching labeled xfrm(s). + +If this is an outbound flow, the label is derived from the socket, if any, or +the incoming packet this flow is being generated as a response to (e.g. tcp +resets, timewait ack, etc.). It is also conceivable that the label could be +derived from other sources such as process context, device, etc., in special +cases, as may be appropriate. + +If this is an inbound flow, the label is derived from the IPSec security +associations, if any, used by the packet. diff --git a/Documentation/networking/secid.txt b/Documentation/networking/secid.txt deleted file mode 100644 index 95ea06784333..000000000000 --- a/Documentation/networking/secid.txt +++ /dev/null @@ -1,14 +0,0 @@ -flowi structure: - -The secid member in the flow structure is used in LSMs (e.g. SELinux) to indicate -the label of the flow. This label of the flow is currently used in selecting -matching labeled xfrm(s). - -If this is an outbound flow, the label is derived from the socket, if any, or -the incoming packet this flow is being generated as a response to (e.g. tcp -resets, timewait ack, etc.). It is also conceivable that the label could be -derived from other sources such as process context, device, etc., in special -cases, as may be appropriate. - -If this is an inbound flow, the label is derived from the IPSec security -associations, if any, used by the packet. -- cgit v1.2.3 From d6c48bc6f8da38590946d23a27138d5258aca261 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:24 +0200 Subject: docs: networking: convert seg6-sysctl.txt to ReST - add SPDX header; - mark code blocks and literals as such; - add a document title; - adjust chapters, adding proper markups; - mark lists as such; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/seg6-sysctl.rst | 26 ++++++++++++++++++++++++++ Documentation/networking/seg6-sysctl.txt | 18 ------------------ 3 files changed, 27 insertions(+), 18 deletions(-) create mode 100644 Documentation/networking/seg6-sysctl.rst delete mode 100644 Documentation/networking/seg6-sysctl.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 8b672f252f67..716744c568b7 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -102,6 +102,7 @@ Contents: rxrpc sctp secid + seg6-sysctl .. only:: subproject and html diff --git a/Documentation/networking/seg6-sysctl.rst b/Documentation/networking/seg6-sysctl.rst new file mode 100644 index 000000000000..ec73e1445030 --- /dev/null +++ b/Documentation/networking/seg6-sysctl.rst @@ -0,0 +1,26 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +Seg6 Sysfs variables +==================== + + +/proc/sys/net/conf//seg6_* variables: +============================================ + +seg6_enabled - BOOL + Accept or drop SR-enabled IPv6 packets on this interface. + + Relevant packets are those with SRH present and DA = local. + + * 0 - disabled (default) + * not 0 - enabled + +seg6_require_hmac - INTEGER + Define HMAC policy for ingress SR-enabled packets on this interface. + + * -1 - Ignore HMAC field + * 0 - Accept SR packets without HMAC, validate SR packets with HMAC + * 1 - Drop SR packets without HMAC, validate SR packets with HMAC + + Default is 0. diff --git a/Documentation/networking/seg6-sysctl.txt b/Documentation/networking/seg6-sysctl.txt deleted file mode 100644 index bdbde23b19cb..000000000000 --- a/Documentation/networking/seg6-sysctl.txt +++ /dev/null @@ -1,18 +0,0 @@ -/proc/sys/net/conf//seg6_* variables: - -seg6_enabled - BOOL - Accept or drop SR-enabled IPv6 packets on this interface. - - Relevant packets are those with SRH present and DA = local. - - 0 - disabled (default) - not 0 - enabled - -seg6_require_hmac - INTEGER - Define HMAC policy for ingress SR-enabled packets on this interface. - - -1 - Ignore HMAC field - 0 - Accept SR packets without HMAC, validate SR packets with HMAC - 1 - Drop SR packets without HMAC, validate SR packets with HMAC - - Default is 0. -- cgit v1.2.3 From fe3dfe418cbbd1ee168d9294725d482029097694 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:25 +0200 Subject: docs: networking: convert skfp.txt to ReST - add SPDX header; - use copyright symbol; - add a document title; - adjust titles and chapters, adding proper markups; - comment out text-only TOC from html/pdf output; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/skfp.rst | 253 +++++++++++++++++++++++++++++++++++++ Documentation/networking/skfp.txt | 220 -------------------------------- drivers/net/fddi/Kconfig | 2 +- 4 files changed, 255 insertions(+), 221 deletions(-) create mode 100644 Documentation/networking/skfp.rst delete mode 100644 Documentation/networking/skfp.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 716744c568b7..d19ddcbe66e5 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -103,6 +103,7 @@ Contents: sctp secid seg6-sysctl + skfp .. only:: subproject and html diff --git a/Documentation/networking/skfp.rst b/Documentation/networking/skfp.rst new file mode 100644 index 000000000000..58f548105c1d --- /dev/null +++ b/Documentation/networking/skfp.rst @@ -0,0 +1,253 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. include:: + +======================== +SysKonnect driver - SKFP +======================== + +|copy| Copyright 1998-2000 SysKonnect, + +skfp.txt created 11-May-2000 + +Readme File for skfp.o v2.06 + + +.. This file contains + + (1) OVERVIEW + (2) SUPPORTED ADAPTERS + (3) GENERAL INFORMATION + (4) INSTALLATION + (5) INCLUSION OF THE ADAPTER IN SYSTEM START + (6) TROUBLESHOOTING + (7) FUNCTION OF THE ADAPTER LEDS + (8) HISTORY + + +1. Overview +=========== + +This README explains how to use the driver 'skfp' for Linux with your +network adapter. + +Chapter 2: Contains a list of all network adapters that are supported by +this driver. + +Chapter 3: + Gives some general information. + +Chapter 4: Describes common problems and solutions. + +Chapter 5: Shows the changed functionality of the adapter LEDs. + +Chapter 6: History of development. + + +2. Supported adapters +===================== + +The network driver 'skfp' supports the following network adapters: +SysKonnect adapters: + + - SK-5521 (SK-NET FDDI-UP) + - SK-5522 (SK-NET FDDI-UP DAS) + - SK-5541 (SK-NET FDDI-FP) + - SK-5543 (SK-NET FDDI-LP) + - SK-5544 (SK-NET FDDI-LP DAS) + - SK-5821 (SK-NET FDDI-UP64) + - SK-5822 (SK-NET FDDI-UP64 DAS) + - SK-5841 (SK-NET FDDI-FP64) + - SK-5843 (SK-NET FDDI-LP64) + - SK-5844 (SK-NET FDDI-LP64 DAS) + +Compaq adapters (not tested): + + - Netelligent 100 FDDI DAS Fibre SC + - Netelligent 100 FDDI SAS Fibre SC + - Netelligent 100 FDDI DAS UTP + - Netelligent 100 FDDI SAS UTP + - Netelligent 100 FDDI SAS Fibre MIC + + +3. General Information +====================== + +From v2.01 on, the driver is integrated in the linux kernel sources. +Therefore, the installation is the same as for any other adapter +supported by the kernel. + +Refer to the manual of your distribution about the installation +of network adapters. + +Makes my life much easier :-) + +4. Troubleshooting +================== + +If you run into problems during installation, check those items: + +Problem: + The FDDI adapter cannot be found by the driver. + +Reason: + Look in /proc/pci for the following entry: + + 'FDDI network controller: SysKonnect SK-FDDI-PCI ...' + + If this entry exists, then the FDDI adapter has been + found by the system and should be able to be used. + + If this entry does not exist or if the file '/proc/pci' + is not there, then you may have a hardware problem or PCI + support may not be enabled in your kernel. + + The adapter can be checked using the diagnostic program + which is available from the SysKonnect web site: + + www.syskonnect.de + + Some COMPAQ machines have a problem with PCI under + Linux. This is described in the 'PCI howto' document + (included in some distributions or available from the + www, e.g. at 'www.linux.org') and no workaround is available. + +Problem: + You want to use your computer as a router between + multiple IP subnetworks (using multiple adapters), but + you cannot reach computers in other subnetworks. + +Reason: + Either the router's kernel is not configured for IP + forwarding or there is a problem with the routing table + and gateway configuration in at least one of the + computers. + +If your problem is not listed here, please contact our +technical support for help. + +You can send email to: linux@syskonnect.de + +When contacting our technical support, +please ensure that the following information is available: + +- System Manufacturer and Model +- Boards in your system +- Distribution +- Kernel version + + +5. Function of the Adapter LEDs +=============================== + + The functionality of the LED's on the FDDI network adapters was + changed in SMT version v2.82. With this new SMT version, the yellow + LED works as a ring operational indicator. An active yellow LED + indicates that the ring is down. The green LED on the adapter now + works as a link indicator where an active GREEN LED indicates that + the respective port has a physical connection. + + With versions of SMT prior to v2.82 a ring up was indicated if the + yellow LED was off while the green LED(s) showed the connection + status of the adapter. During a ring down the green LED was off and + the yellow LED was on. + + All implementations indicate that a driver is not loaded if + all LEDs are off. + + +6. History +========== + +v2.06 (20000511) (In-Kernel version) + New features: + + - 64 bit support + - new pci dma interface + - in kernel 2.3.99 + +v2.05 (20000217) (In-Kernel version) + New features: + + - Changes for 2.3.45 kernel + +v2.04 (20000207) (Standalone version) + New features: + + - Added rx/tx byte counter + +v2.03 (20000111) (Standalone version) + Problems fixed: + + - Fixed printk statements from v2.02 + +v2.02 (991215) (Standalone version) + Problems fixed: + + - Removed unnecessary output + - Fixed path for "printver.sh" in makefile + +v2.01 (991122) (In-Kernel version) + New features: + + - Integration in Linux kernel sources + - Support for memory mapped I/O. + +v2.00 (991112) + New features: + + - Full source released under GPL + +v1.05 (991023) + Problems fixed: + + - Compilation with kernel version 2.2.13 failed + +v1.04 (990427) + Changes: + + - New SMT module included, changing LED functionality + + Problems fixed: + + - Synchronization on SMP machines was buggy + +v1.03 (990325) + Problems fixed: + + - Interrupt routing on SMP machines could be incorrect + +v1.02 (990310) + New features: + + - Support for kernel versions 2.2.x added + - Kernel patch instead of private duplicate of kernel functions + +v1.01 (980812) + Problems fixed: + + Connection hangup with telnet + Slow telnet connection + +v1.00 beta 01 (980507) + New features: + + None. + + Problems fixed: + + None. + + Known limitations: + + - tar archive instead of standard package format (rpm). + - FDDI statistic is empty. + - not tested with 2.1.xx kernels + - integration in kernel not tested + - not tested simultaneously with FDDI adapters from other vendors. + - only X86 processors supported. + - SBA (Synchronous Bandwidth Allocator) parameters can + not be configured. + - does not work on some COMPAQ machines. See the PCI howto + document for details about this problem. + - data corruption with kernel versions below 2.0.33. diff --git a/Documentation/networking/skfp.txt b/Documentation/networking/skfp.txt deleted file mode 100644 index 203ec66c9fb4..000000000000 --- a/Documentation/networking/skfp.txt +++ /dev/null @@ -1,220 +0,0 @@ -(C)Copyright 1998-2000 SysKonnect, -=========================================================================== - -skfp.txt created 11-May-2000 - -Readme File for skfp.o v2.06 - - -This file contains -(1) OVERVIEW -(2) SUPPORTED ADAPTERS -(3) GENERAL INFORMATION -(4) INSTALLATION -(5) INCLUSION OF THE ADAPTER IN SYSTEM START -(6) TROUBLESHOOTING -(7) FUNCTION OF THE ADAPTER LEDS -(8) HISTORY - -=========================================================================== - - - -(1) OVERVIEW -============ - -This README explains how to use the driver 'skfp' for Linux with your -network adapter. - -Chapter 2: Contains a list of all network adapters that are supported by - this driver. - -Chapter 3: Gives some general information. - -Chapter 4: Describes common problems and solutions. - -Chapter 5: Shows the changed functionality of the adapter LEDs. - -Chapter 6: History of development. - -*** - - -(2) SUPPORTED ADAPTERS -====================== - -The network driver 'skfp' supports the following network adapters: -SysKonnect adapters: - - SK-5521 (SK-NET FDDI-UP) - - SK-5522 (SK-NET FDDI-UP DAS) - - SK-5541 (SK-NET FDDI-FP) - - SK-5543 (SK-NET FDDI-LP) - - SK-5544 (SK-NET FDDI-LP DAS) - - SK-5821 (SK-NET FDDI-UP64) - - SK-5822 (SK-NET FDDI-UP64 DAS) - - SK-5841 (SK-NET FDDI-FP64) - - SK-5843 (SK-NET FDDI-LP64) - - SK-5844 (SK-NET FDDI-LP64 DAS) -Compaq adapters (not tested): - - Netelligent 100 FDDI DAS Fibre SC - - Netelligent 100 FDDI SAS Fibre SC - - Netelligent 100 FDDI DAS UTP - - Netelligent 100 FDDI SAS UTP - - Netelligent 100 FDDI SAS Fibre MIC -*** - - -(3) GENERAL INFORMATION -======================= - -From v2.01 on, the driver is integrated in the linux kernel sources. -Therefore, the installation is the same as for any other adapter -supported by the kernel. -Refer to the manual of your distribution about the installation -of network adapters. -Makes my life much easier :-) -*** - - -(4) TROUBLESHOOTING -=================== - -If you run into problems during installation, check those items: - -Problem: The FDDI adapter cannot be found by the driver. -Reason: Look in /proc/pci for the following entry: - 'FDDI network controller: SysKonnect SK-FDDI-PCI ...' - If this entry exists, then the FDDI adapter has been - found by the system and should be able to be used. - If this entry does not exist or if the file '/proc/pci' - is not there, then you may have a hardware problem or PCI - support may not be enabled in your kernel. - The adapter can be checked using the diagnostic program - which is available from the SysKonnect web site: - www.syskonnect.de - Some COMPAQ machines have a problem with PCI under - Linux. This is described in the 'PCI howto' document - (included in some distributions or available from the - www, e.g. at 'www.linux.org') and no workaround is available. - -Problem: You want to use your computer as a router between - multiple IP subnetworks (using multiple adapters), but - you cannot reach computers in other subnetworks. -Reason: Either the router's kernel is not configured for IP - forwarding or there is a problem with the routing table - and gateway configuration in at least one of the - computers. - -If your problem is not listed here, please contact our -technical support for help. -You can send email to: - linux@syskonnect.de -When contacting our technical support, -please ensure that the following information is available: -- System Manufacturer and Model -- Boards in your system -- Distribution -- Kernel version - -*** - - -(5) FUNCTION OF THE ADAPTER LEDS -================================ - - The functionality of the LED's on the FDDI network adapters was - changed in SMT version v2.82. With this new SMT version, the yellow - LED works as a ring operational indicator. An active yellow LED - indicates that the ring is down. The green LED on the adapter now - works as a link indicator where an active GREEN LED indicates that - the respective port has a physical connection. - - With versions of SMT prior to v2.82 a ring up was indicated if the - yellow LED was off while the green LED(s) showed the connection - status of the adapter. During a ring down the green LED was off and - the yellow LED was on. - - All implementations indicate that a driver is not loaded if - all LEDs are off. - -*** - - -(6) HISTORY -=========== - -v2.06 (20000511) (In-Kernel version) - New features: - - 64 bit support - - new pci dma interface - - in kernel 2.3.99 - -v2.05 (20000217) (In-Kernel version) - New features: - - Changes for 2.3.45 kernel - -v2.04 (20000207) (Standalone version) - New features: - - Added rx/tx byte counter - -v2.03 (20000111) (Standalone version) - Problems fixed: - - Fixed printk statements from v2.02 - -v2.02 (991215) (Standalone version) - Problems fixed: - - Removed unnecessary output - - Fixed path for "printver.sh" in makefile - -v2.01 (991122) (In-Kernel version) - New features: - - Integration in Linux kernel sources - - Support for memory mapped I/O. - -v2.00 (991112) - New features: - - Full source released under GPL - -v1.05 (991023) - Problems fixed: - - Compilation with kernel version 2.2.13 failed - -v1.04 (990427) - Changes: - - New SMT module included, changing LED functionality - Problems fixed: - - Synchronization on SMP machines was buggy - -v1.03 (990325) - Problems fixed: - - Interrupt routing on SMP machines could be incorrect - -v1.02 (990310) - New features: - - Support for kernel versions 2.2.x added - - Kernel patch instead of private duplicate of kernel functions - -v1.01 (980812) - Problems fixed: - Connection hangup with telnet - Slow telnet connection - -v1.00 beta 01 (980507) - New features: - None. - Problems fixed: - None. - Known limitations: - - tar archive instead of standard package format (rpm). - - FDDI statistic is empty. - - not tested with 2.1.xx kernels - - integration in kernel not tested - - not tested simultaneously with FDDI adapters from other vendors. - - only X86 processors supported. - - SBA (Synchronous Bandwidth Allocator) parameters can - not be configured. - - does not work on some COMPAQ machines. See the PCI howto - document for details about this problem. - - data corruption with kernel versions below 2.0.33. - -*** End of information file *** diff --git a/drivers/net/fddi/Kconfig b/drivers/net/fddi/Kconfig index 3b412a56f2cb..da4f58eed08f 100644 --- a/drivers/net/fddi/Kconfig +++ b/drivers/net/fddi/Kconfig @@ -77,7 +77,7 @@ config SKFP - Netelligent 100 FDDI SAS UTP - Netelligent 100 FDDI SAS Fibre MIC - Read for information about + Read for information about the driver. Questions concerning this driver can be addressed to: -- cgit v1.2.3 From 060d9d3e1282e3ccf4bbb3cc4ea94bed3ce69310 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:26 +0200 Subject: docs: networking: convert strparser.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/strparser.rst | 240 +++++++++++++++++++++++++++++++++ Documentation/networking/strparser.txt | 207 ---------------------------- 3 files changed, 241 insertions(+), 207 deletions(-) create mode 100644 Documentation/networking/strparser.rst delete mode 100644 Documentation/networking/strparser.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index d19ddcbe66e5..e5a705024c6a 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -104,6 +104,7 @@ Contents: secid seg6-sysctl skfp + strparser .. only:: subproject and html diff --git a/Documentation/networking/strparser.rst b/Documentation/networking/strparser.rst new file mode 100644 index 000000000000..6cab1f74ae05 --- /dev/null +++ b/Documentation/networking/strparser.rst @@ -0,0 +1,240 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================= +Stream Parser (strparser) +========================= + +Introduction +============ + +The stream parser (strparser) is a utility that parses messages of an +application layer protocol running over a data stream. The stream +parser works in conjunction with an upper layer in the kernel to provide +kernel support for application layer messages. For instance, Kernel +Connection Multiplexor (KCM) uses the Stream Parser to parse messages +using a BPF program. + +The strparser works in one of two modes: receive callback or general +mode. + +In receive callback mode, the strparser is called from the data_ready +callback of a TCP socket. Messages are parsed and delivered as they are +received on the socket. + +In general mode, a sequence of skbs are fed to strparser from an +outside source. Message are parsed and delivered as the sequence is +processed. This modes allows strparser to be applied to arbitrary +streams of data. + +Interface +========= + +The API includes a context structure, a set of callbacks, utility +functions, and a data_ready function for receive callback mode. The +callbacks include a parse_msg function that is called to perform +parsing (e.g. BPF parsing in case of KCM), and a rcv_msg function +that is called when a full message has been completed. + +Functions +========= + + :: + + strp_init(struct strparser *strp, struct sock *sk, + const struct strp_callbacks *cb) + + Called to initialize a stream parser. strp is a struct of type + strparser that is allocated by the upper layer. sk is the TCP + socket associated with the stream parser for use with receive + callback mode; in general mode this is set to NULL. Callbacks + are called by the stream parser (the callbacks are listed below). + + :: + + void strp_pause(struct strparser *strp) + + Temporarily pause a stream parser. Message parsing is suspended + and no new messages are delivered to the upper layer. + + :: + + void strp_unpause(struct strparser *strp) + + Unpause a paused stream parser. + + :: + + void strp_stop(struct strparser *strp); + + strp_stop is called to completely stop stream parser operations. + This is called internally when the stream parser encounters an + error, and it is called from the upper layer to stop parsing + operations. + + :: + + void strp_done(struct strparser *strp); + + strp_done is called to release any resources held by the stream + parser instance. This must be called after the stream processor + has been stopped. + + :: + + int strp_process(struct strparser *strp, struct sk_buff *orig_skb, + unsigned int orig_offset, size_t orig_len, + size_t max_msg_size, long timeo) + + strp_process is called in general mode for a stream parser to + parse an sk_buff. The number of bytes processed or a negative + error number is returned. Note that strp_process does not + consume the sk_buff. max_msg_size is maximum size the stream + parser will parse. timeo is timeout for completing a message. + + :: + + void strp_data_ready(struct strparser *strp); + + The upper layer calls strp_tcp_data_ready when data is ready on + the lower socket for strparser to process. This should be called + from a data_ready callback that is set on the socket. Note that + maximum messages size is the limit of the receive socket + buffer and message timeout is the receive timeout for the socket. + + :: + + void strp_check_rcv(struct strparser *strp); + + strp_check_rcv is called to check for new messages on the socket. + This is normally called at initialization of a stream parser + instance or after strp_unpause. + +Callbacks +========= + +There are six callbacks: + + :: + + int (*parse_msg)(struct strparser *strp, struct sk_buff *skb); + + parse_msg is called to determine the length of the next message + in the stream. The upper layer must implement this function. It + should parse the sk_buff as containing the headers for the + next application layer message in the stream. + + The skb->cb in the input skb is a struct strp_msg. Only + the offset field is relevant in parse_msg and gives the offset + where the message starts in the skb. + + The return values of this function are: + + ========= =========================================================== + >0 indicates length of successfully parsed message + 0 indicates more data must be received to parse the message + -ESTRPIPE current message should not be processed by the + kernel, return control of the socket to userspace which + can proceed to read the messages itself + other < 0 Error in parsing, give control back to userspace + assuming that synchronization is lost and the stream + is unrecoverable (application expected to close TCP socket) + ========= =========================================================== + + In the case that an error is returned (return value is less than + zero) and the parser is in receive callback mode, then it will set + the error on TCP socket and wake it up. If parse_msg returned + -ESTRPIPE and the stream parser had previously read some bytes for + the current message, then the error set on the attached socket is + ENODATA since the stream is unrecoverable in that case. + + :: + + void (*lock)(struct strparser *strp) + + The lock callback is called to lock the strp structure when + the strparser is performing an asynchronous operation (such as + processing a timeout). In receive callback mode the default + function is to lock_sock for the associated socket. In general + mode the callback must be set appropriately. + + :: + + void (*unlock)(struct strparser *strp) + + The unlock callback is called to release the lock obtained + by the lock callback. In receive callback mode the default + function is release_sock for the associated socket. In general + mode the callback must be set appropriately. + + :: + + void (*rcv_msg)(struct strparser *strp, struct sk_buff *skb); + + rcv_msg is called when a full message has been received and + is queued. The callee must consume the sk_buff; it can + call strp_pause to prevent any further messages from being + received in rcv_msg (see strp_pause above). This callback + must be set. + + The skb->cb in the input skb is a struct strp_msg. This + struct contains two fields: offset and full_len. Offset is + where the message starts in the skb, and full_len is the + the length of the message. skb->len - offset may be greater + then full_len since strparser does not trim the skb. + + :: + + int (*read_sock_done)(struct strparser *strp, int err); + + read_sock_done is called when the stream parser is done reading + the TCP socket in receive callback mode. The stream parser may + read multiple messages in a loop and this function allows cleanup + to occur when exiting the loop. If the callback is not set (NULL + in strp_init) a default function is used. + + :: + + void (*abort_parser)(struct strparser *strp, int err); + + This function is called when stream parser encounters an error + in parsing. The default function stops the stream parser and + sets the error in the socket if the parser is in receive callback + mode. The default function can be changed by setting the callback + to non-NULL in strp_init. + +Statistics +========== + +Various counters are kept for each stream parser instance. These are in +the strp_stats structure. strp_aggr_stats is a convenience structure for +accumulating statistics for multiple stream parser instances. +save_strp_stats and aggregate_strp_stats are helper functions to save +and aggregate statistics. + +Message assembly limits +======================= + +The stream parser provide mechanisms to limit the resources consumed by +message assembly. + +A timer is set when assembly starts for a new message. In receive +callback mode the message timeout is taken from rcvtime for the +associated TCP socket. In general mode, the timeout is passed as an +argument in strp_process. If the timer fires before assembly completes +the stream parser is aborted and the ETIMEDOUT error is set on the TCP +socket if in receive callback mode. + +In receive callback mode, message length is limited to the receive +buffer size of the associated TCP socket. If the length returned by +parse_msg is greater than the socket buffer size then the stream parser +is aborted with EMSGSIZE error set on the TCP socket. Note that this +makes the maximum size of receive skbuffs for a socket with a stream +parser to be 2*sk_rcvbuf of the TCP socket. + +In general mode the message length limit is passed in as an argument +to strp_process. + +Author +====== + +Tom Herbert (tom@quantonium.net) diff --git a/Documentation/networking/strparser.txt b/Documentation/networking/strparser.txt deleted file mode 100644 index a7d354ddda7b..000000000000 --- a/Documentation/networking/strparser.txt +++ /dev/null @@ -1,207 +0,0 @@ -Stream Parser (strparser) - -Introduction -============ - -The stream parser (strparser) is a utility that parses messages of an -application layer protocol running over a data stream. The stream -parser works in conjunction with an upper layer in the kernel to provide -kernel support for application layer messages. For instance, Kernel -Connection Multiplexor (KCM) uses the Stream Parser to parse messages -using a BPF program. - -The strparser works in one of two modes: receive callback or general -mode. - -In receive callback mode, the strparser is called from the data_ready -callback of a TCP socket. Messages are parsed and delivered as they are -received on the socket. - -In general mode, a sequence of skbs are fed to strparser from an -outside source. Message are parsed and delivered as the sequence is -processed. This modes allows strparser to be applied to arbitrary -streams of data. - -Interface -========= - -The API includes a context structure, a set of callbacks, utility -functions, and a data_ready function for receive callback mode. The -callbacks include a parse_msg function that is called to perform -parsing (e.g. BPF parsing in case of KCM), and a rcv_msg function -that is called when a full message has been completed. - -Functions -========= - -strp_init(struct strparser *strp, struct sock *sk, - const struct strp_callbacks *cb) - - Called to initialize a stream parser. strp is a struct of type - strparser that is allocated by the upper layer. sk is the TCP - socket associated with the stream parser for use with receive - callback mode; in general mode this is set to NULL. Callbacks - are called by the stream parser (the callbacks are listed below). - -void strp_pause(struct strparser *strp) - - Temporarily pause a stream parser. Message parsing is suspended - and no new messages are delivered to the upper layer. - -void strp_unpause(struct strparser *strp) - - Unpause a paused stream parser. - -void strp_stop(struct strparser *strp); - - strp_stop is called to completely stop stream parser operations. - This is called internally when the stream parser encounters an - error, and it is called from the upper layer to stop parsing - operations. - -void strp_done(struct strparser *strp); - - strp_done is called to release any resources held by the stream - parser instance. This must be called after the stream processor - has been stopped. - -int strp_process(struct strparser *strp, struct sk_buff *orig_skb, - unsigned int orig_offset, size_t orig_len, - size_t max_msg_size, long timeo) - - strp_process is called in general mode for a stream parser to - parse an sk_buff. The number of bytes processed or a negative - error number is returned. Note that strp_process does not - consume the sk_buff. max_msg_size is maximum size the stream - parser will parse. timeo is timeout for completing a message. - -void strp_data_ready(struct strparser *strp); - - The upper layer calls strp_tcp_data_ready when data is ready on - the lower socket for strparser to process. This should be called - from a data_ready callback that is set on the socket. Note that - maximum messages size is the limit of the receive socket - buffer and message timeout is the receive timeout for the socket. - -void strp_check_rcv(struct strparser *strp); - - strp_check_rcv is called to check for new messages on the socket. - This is normally called at initialization of a stream parser - instance or after strp_unpause. - -Callbacks -========= - -There are six callbacks: - -int (*parse_msg)(struct strparser *strp, struct sk_buff *skb); - - parse_msg is called to determine the length of the next message - in the stream. The upper layer must implement this function. It - should parse the sk_buff as containing the headers for the - next application layer message in the stream. - - The skb->cb in the input skb is a struct strp_msg. Only - the offset field is relevant in parse_msg and gives the offset - where the message starts in the skb. - - The return values of this function are: - - >0 : indicates length of successfully parsed message - 0 : indicates more data must be received to parse the message - -ESTRPIPE : current message should not be processed by the - kernel, return control of the socket to userspace which - can proceed to read the messages itself - other < 0 : Error in parsing, give control back to userspace - assuming that synchronization is lost and the stream - is unrecoverable (application expected to close TCP socket) - - In the case that an error is returned (return value is less than - zero) and the parser is in receive callback mode, then it will set - the error on TCP socket and wake it up. If parse_msg returned - -ESTRPIPE and the stream parser had previously read some bytes for - the current message, then the error set on the attached socket is - ENODATA since the stream is unrecoverable in that case. - -void (*lock)(struct strparser *strp) - - The lock callback is called to lock the strp structure when - the strparser is performing an asynchronous operation (such as - processing a timeout). In receive callback mode the default - function is to lock_sock for the associated socket. In general - mode the callback must be set appropriately. - -void (*unlock)(struct strparser *strp) - - The unlock callback is called to release the lock obtained - by the lock callback. In receive callback mode the default - function is release_sock for the associated socket. In general - mode the callback must be set appropriately. - -void (*rcv_msg)(struct strparser *strp, struct sk_buff *skb); - - rcv_msg is called when a full message has been received and - is queued. The callee must consume the sk_buff; it can - call strp_pause to prevent any further messages from being - received in rcv_msg (see strp_pause above). This callback - must be set. - - The skb->cb in the input skb is a struct strp_msg. This - struct contains two fields: offset and full_len. Offset is - where the message starts in the skb, and full_len is the - the length of the message. skb->len - offset may be greater - then full_len since strparser does not trim the skb. - -int (*read_sock_done)(struct strparser *strp, int err); - - read_sock_done is called when the stream parser is done reading - the TCP socket in receive callback mode. The stream parser may - read multiple messages in a loop and this function allows cleanup - to occur when exiting the loop. If the callback is not set (NULL - in strp_init) a default function is used. - -void (*abort_parser)(struct strparser *strp, int err); - - This function is called when stream parser encounters an error - in parsing. The default function stops the stream parser and - sets the error in the socket if the parser is in receive callback - mode. The default function can be changed by setting the callback - to non-NULL in strp_init. - -Statistics -========== - -Various counters are kept for each stream parser instance. These are in -the strp_stats structure. strp_aggr_stats is a convenience structure for -accumulating statistics for multiple stream parser instances. -save_strp_stats and aggregate_strp_stats are helper functions to save -and aggregate statistics. - -Message assembly limits -======================= - -The stream parser provide mechanisms to limit the resources consumed by -message assembly. - -A timer is set when assembly starts for a new message. In receive -callback mode the message timeout is taken from rcvtime for the -associated TCP socket. In general mode, the timeout is passed as an -argument in strp_process. If the timer fires before assembly completes -the stream parser is aborted and the ETIMEDOUT error is set on the TCP -socket if in receive callback mode. - -In receive callback mode, message length is limited to the receive -buffer size of the associated TCP socket. If the length returned by -parse_msg is greater than the socket buffer size then the stream parser -is aborted with EMSGSIZE error set on the TCP socket. Note that this -makes the maximum size of receive skbuffs for a socket with a stream -parser to be 2*sk_rcvbuf of the TCP socket. - -In general mode the message length limit is passed in as an argument -to strp_process. - -Author -====== - -Tom Herbert (tom@quantonium.net) - -- cgit v1.2.3 From 32c0f0bed5bb08625083ed7f5b661c842d63ebd1 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:27 +0200 Subject: docs: networking: convert switchdev.txt to ReST - add SPDX header; - use copyright symbol; - adjust title markup; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/switchdev.rst | 387 +++++++++++++++++++++++++++++++++ Documentation/networking/switchdev.txt | 373 ------------------------------- drivers/staging/fsl-dpaa2/ethsw/README | 2 +- 4 files changed, 389 insertions(+), 374 deletions(-) create mode 100644 Documentation/networking/switchdev.rst delete mode 100644 Documentation/networking/switchdev.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index e5a705024c6a..5e495804f96f 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -105,6 +105,7 @@ Contents: seg6-sysctl skfp strparser + switchdev .. only:: subproject and html diff --git a/Documentation/networking/switchdev.rst b/Documentation/networking/switchdev.rst new file mode 100644 index 000000000000..ddc3f35775dc --- /dev/null +++ b/Documentation/networking/switchdev.rst @@ -0,0 +1,387 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +=============================================== +Ethernet switch device driver model (switchdev) +=============================================== + +Copyright |copy| 2014 Jiri Pirko + +Copyright |copy| 2014-2015 Scott Feldman + + +The Ethernet switch device driver model (switchdev) is an in-kernel driver +model for switch devices which offload the forwarding (data) plane from the +kernel. + +Figure 1 is a block diagram showing the components of the switchdev model for +an example setup using a data-center-class switch ASIC chip. Other setups +with SR-IOV or soft switches, such as OVS, are possible. + +:: + + + User-space tools + + user space | + +-------------------------------------------------------------------+ + kernel | Netlink + | + +--------------+-------------------------------+ + | Network stack | + | (Linux) | + | | + +----------------------------------------------+ + + sw1p2 sw1p4 sw1p6 + sw1p1 + sw1p3 + sw1p5 + eth1 + + | + | + | + + | | | | | | | + +--+----+----+----+----+----+---+ +-----+-----+ + | Switch driver | | mgmt | + | (this document) | | driver | + | | | | + +--------------+----------------+ +-----------+ + | + kernel | HW bus (eg PCI) + +-------------------------------------------------------------------+ + hardware | + +--------------+----------------+ + | Switch device (sw1) | + | +----+ +--------+ + | | v offloaded data path | mgmt port + | | | | + +--|----|----+----+----+----+---+ + | | | | | | + + + + + + + + p1 p2 p3 p4 p5 p6 + + front-panel ports + + + Fig 1. + + +Include Files +------------- + +:: + + #include + #include + + +Configuration +------------- + +Use "depends NET_SWITCHDEV" in driver's Kconfig to ensure switchdev model +support is built for driver. + + +Switch Ports +------------ + +On switchdev driver initialization, the driver will allocate and register a +struct net_device (using register_netdev()) for each enumerated physical switch +port, called the port netdev. A port netdev is the software representation of +the physical port and provides a conduit for control traffic to/from the +controller (the kernel) and the network, as well as an anchor point for higher +level constructs such as bridges, bonds, VLANs, tunnels, and L3 routers. Using +standard netdev tools (iproute2, ethtool, etc), the port netdev can also +provide to the user access to the physical properties of the switch port such +as PHY link state and I/O statistics. + +There is (currently) no higher-level kernel object for the switch beyond the +port netdevs. All of the switchdev driver ops are netdev ops or switchdev ops. + +A switch management port is outside the scope of the switchdev driver model. +Typically, the management port is not participating in offloaded data plane and +is loaded with a different driver, such as a NIC driver, on the management port +device. + +Switch ID +^^^^^^^^^ + +The switchdev driver must implement the net_device operation +ndo_get_port_parent_id for each port netdev, returning the same physical ID for +each port of a switch. The ID must be unique between switches on the same +system. The ID does not need to be unique between switches on different +systems. + +The switch ID is used to locate ports on a switch and to know if aggregated +ports belong to the same switch. + +Port Netdev Naming +^^^^^^^^^^^^^^^^^^ + +Udev rules should be used for port netdev naming, using some unique attribute +of the port as a key, for example the port MAC address or the port PHYS name. +Hard-coding of kernel netdev names within the driver is discouraged; let the +kernel pick the default netdev name, and let udev set the final name based on a +port attribute. + +Using port PHYS name (ndo_get_phys_port_name) for the key is particularly +useful for dynamically-named ports where the device names its ports based on +external configuration. For example, if a physical 40G port is split logically +into 4 10G ports, resulting in 4 port netdevs, the device can give a unique +name for each port using port PHYS name. The udev rule would be:: + + SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="", \ + ATTR{phys_port_name}!="", NAME="swX$attr{phys_port_name}" + +Suggested naming convention is "swXpYsZ", where X is the switch name or ID, Y +is the port name or ID, and Z is the sub-port name or ID. For example, sw1p1s0 +would be sub-port 0 on port 1 on switch 1. + +Port Features +^^^^^^^^^^^^^ + +NETIF_F_NETNS_LOCAL + +If the switchdev driver (and device) only supports offloading of the default +network namespace (netns), the driver should set this feature flag to prevent +the port netdev from being moved out of the default netns. A netns-aware +driver/device would not set this flag and be responsible for partitioning +hardware to preserve netns containment. This means hardware cannot forward +traffic from a port in one namespace to another port in another namespace. + +Port Topology +^^^^^^^^^^^^^ + +The port netdevs representing the physical switch ports can be organized into +higher-level switching constructs. The default construct is a standalone +router port, used to offload L3 forwarding. Two or more ports can be bonded +together to form a LAG. Two or more ports (or LAGs) can be bridged to bridge +L2 networks. VLANs can be applied to sub-divide L2 networks. L2-over-L3 +tunnels can be built on ports. These constructs are built using standard Linux +tools such as the bridge driver, the bonding/team drivers, and netlink-based +tools such as iproute2. + +The switchdev driver can know a particular port's position in the topology by +monitoring NETDEV_CHANGEUPPER notifications. For example, a port moved into a +bond will see it's upper master change. If that bond is moved into a bridge, +the bond's upper master will change. And so on. The driver will track such +movements to know what position a port is in in the overall topology by +registering for netdevice events and acting on NETDEV_CHANGEUPPER. + +L2 Forwarding Offload +--------------------- + +The idea is to offload the L2 data forwarding (switching) path from the kernel +to the switchdev device by mirroring bridge FDB entries down to the device. An +FDB entry is the {port, MAC, VLAN} tuple forwarding destination. + +To offloading L2 bridging, the switchdev driver/device should support: + + - Static FDB entries installed on a bridge port + - Notification of learned/forgotten src mac/vlans from device + - STP state changes on the port + - VLAN flooding of multicast/broadcast and unknown unicast packets + +Static FDB Entries +^^^^^^^^^^^^^^^^^^ + +The switchdev driver should implement ndo_fdb_add, ndo_fdb_del and ndo_fdb_dump +to support static FDB entries installed to the device. Static bridge FDB +entries are installed, for example, using iproute2 bridge cmd:: + + bridge fdb add ADDR dev DEV [vlan VID] [self] + +The driver should use the helper switchdev_port_fdb_xxx ops for ndo_fdb_xxx +ops, and handle add/delete/dump of SWITCHDEV_OBJ_ID_PORT_FDB object using +switchdev_port_obj_xxx ops. + +XXX: what should be done if offloading this rule to hardware fails (for +example, due to full capacity in hardware tables) ? + +Note: by default, the bridge does not filter on VLAN and only bridges untagged +traffic. To enable VLAN support, turn on VLAN filtering:: + + echo 1 >/sys/class/net//bridge/vlan_filtering + +Notification of Learned/Forgotten Source MAC/VLANs +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The switch device will learn/forget source MAC address/VLAN on ingress packets +and notify the switch driver of the mac/vlan/port tuples. The switch driver, +in turn, will notify the bridge driver using the switchdev notifier call:: + + err = call_switchdev_notifiers(val, dev, info, extack); + +Where val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when +forgetting, and info points to a struct switchdev_notifier_fdb_info. On +SWITCHDEV_FDB_ADD, the bridge driver will install the FDB entry into the +bridge's FDB and mark the entry as NTF_EXT_LEARNED. The iproute2 bridge +command will label these entries "offload":: + + $ bridge fdb + 52:54:00:12:35:01 dev sw1p1 master br0 permanent + 00:02:00:00:02:00 dev sw1p1 master br0 offload + 00:02:00:00:02:00 dev sw1p1 self + 52:54:00:12:35:02 dev sw1p2 master br0 permanent + 00:02:00:00:03:00 dev sw1p2 master br0 offload + 00:02:00:00:03:00 dev sw1p2 self + 33:33:00:00:00:01 dev eth0 self permanent + 01:00:5e:00:00:01 dev eth0 self permanent + 33:33:ff:00:00:00 dev eth0 self permanent + 01:80:c2:00:00:0e dev eth0 self permanent + 33:33:00:00:00:01 dev br0 self permanent + 01:00:5e:00:00:01 dev br0 self permanent + 33:33:ff:12:35:01 dev br0 self permanent + +Learning on the port should be disabled on the bridge using the bridge command:: + + bridge link set dev DEV learning off + +Learning on the device port should be enabled, as well as learning_sync:: + + bridge link set dev DEV learning on self + bridge link set dev DEV learning_sync on self + +Learning_sync attribute enables syncing of the learned/forgotten FDB entry to +the bridge's FDB. It's possible, but not optimal, to enable learning on the +device port and on the bridge port, and disable learning_sync. + +To support learning, the driver implements switchdev op +switchdev_port_attr_set for SWITCHDEV_ATTR_PORT_ID_{PRE}_BRIDGE_FLAGS. + +FDB Ageing +^^^^^^^^^^ + +The bridge will skip ageing FDB entries marked with NTF_EXT_LEARNED and it is +the responsibility of the port driver/device to age out these entries. If the +port device supports ageing, when the FDB entry expires, it will notify the +driver which in turn will notify the bridge with SWITCHDEV_FDB_DEL. If the +device does not support ageing, the driver can simulate ageing using a +garbage collection timer to monitor FDB entries. Expired entries will be +notified to the bridge using SWITCHDEV_FDB_DEL. See rocker driver for +example of driver running ageing timer. + +To keep an NTF_EXT_LEARNED entry "alive", the driver should refresh the FDB +entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...). The +notification will reset the FDB entry's last-used time to now. The driver +should rate limit refresh notifications, for example, no more than once a +second. (The last-used time is visible using the bridge -s fdb option). + +STP State Change on Port +^^^^^^^^^^^^^^^^^^^^^^^^ + +Internally or with a third-party STP protocol implementation (e.g. mstpd), the +bridge driver maintains the STP state for ports, and will notify the switch +driver of STP state change on a port using the switchdev op +switchdev_attr_port_set for SWITCHDEV_ATTR_PORT_ID_STP_UPDATE. + +State is one of BR_STATE_*. The switch driver can use STP state updates to +update ingress packet filter list for the port. For example, if port is +DISABLED, no packets should pass, but if port moves to BLOCKED, then STP BPDUs +and other IEEE 01:80:c2:xx:xx:xx link-local multicast packets can pass. + +Note that STP BDPUs are untagged and STP state applies to all VLANs on the port +so packet filters should be applied consistently across untagged and tagged +VLANs on the port. + +Flooding L2 domain +^^^^^^^^^^^^^^^^^^ + +For a given L2 VLAN domain, the switch device should flood multicast/broadcast +and unknown unicast packets to all ports in domain, if allowed by port's +current STP state. The switch driver, knowing which ports are within which +vlan L2 domain, can program the switch device for flooding. The packet may +be sent to the port netdev for processing by the bridge driver. The +bridge should not reflood the packet to the same ports the device flooded, +otherwise there will be duplicate packets on the wire. + +To avoid duplicate packets, the switch driver should mark a packet as already +forwarded by setting the skb->offload_fwd_mark bit. The bridge driver will mark +the skb using the ingress bridge port's mark and prevent it from being forwarded +through any bridge port with the same mark. + +It is possible for the switch device to not handle flooding and push the +packets up to the bridge driver for flooding. This is not ideal as the number +of ports scale in the L2 domain as the device is much more efficient at +flooding packets that software. + +If supported by the device, flood control can be offloaded to it, preventing +certain netdevs from flooding unicast traffic for which there is no FDB entry. + +IGMP Snooping +^^^^^^^^^^^^^ + +In order to support IGMP snooping, the port netdevs should trap to the bridge +driver all IGMP join and leave messages. +The bridge multicast module will notify port netdevs on every multicast group +changed whether it is static configured or dynamically joined/leave. +The hardware implementation should be forwarding all registered multicast +traffic groups only to the configured ports. + +L3 Routing Offload +------------------ + +Offloading L3 routing requires that device be programmed with FIB entries from +the kernel, with the device doing the FIB lookup and forwarding. The device +does a longest prefix match (LPM) on FIB entries matching route prefix and +forwards the packet to the matching FIB entry's nexthop(s) egress ports. + +To program the device, the driver has to register a FIB notifier handler +using register_fib_notifier. The following events are available: + +=================== =================================================== +FIB_EVENT_ENTRY_ADD used for both adding a new FIB entry to the device, + or modifying an existing entry on the device. +FIB_EVENT_ENTRY_DEL used for removing a FIB entry +FIB_EVENT_RULE_ADD, +FIB_EVENT_RULE_DEL used to propagate FIB rule changes +=================== =================================================== + +FIB_EVENT_ENTRY_ADD and FIB_EVENT_ENTRY_DEL events pass:: + + struct fib_entry_notifier_info { + struct fib_notifier_info info; /* must be first */ + u32 dst; + int dst_len; + struct fib_info *fi; + u8 tos; + u8 type; + u32 tb_id; + u32 nlflags; + }; + +to add/modify/delete IPv4 dst/dest_len prefix on table tb_id. The ``*fi`` +structure holds details on the route and route's nexthops. ``*dev`` is one +of the port netdevs mentioned in the route's next hop list. + +Routes offloaded to the device are labeled with "offload" in the ip route +listing:: + + $ ip route show + default via 192.168.0.2 dev eth0 + 11.0.0.0/30 dev sw1p1 proto kernel scope link src 11.0.0.2 offload + 11.0.0.4/30 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload + 11.0.0.8/30 dev sw1p2 proto kernel scope link src 11.0.0.10 offload + 11.0.0.12/30 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload + 12.0.0.2 proto zebra metric 30 offload + nexthop via 11.0.0.1 dev sw1p1 weight 1 + nexthop via 11.0.0.9 dev sw1p2 weight 1 + 12.0.0.3 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload + 12.0.0.4 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload + 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.15 + +The "offload" flag is set in case at least one device offloads the FIB entry. + +XXX: add/mod/del IPv6 FIB API + +Nexthop Resolution +^^^^^^^^^^^^^^^^^^ + +The FIB entry's nexthop list contains the nexthop tuple (gateway, dev), but for +the switch device to forward the packet with the correct dst mac address, the +nexthop gateways must be resolved to the neighbor's mac address. Neighbor mac +address discovery comes via the ARP (or ND) process and is available via the +arp_tbl neighbor table. To resolve the routes nexthop gateways, the driver +should trigger the kernel's neighbor resolution process. See the rocker +driver's rocker_port_ipv4_resolve() for an example. + +The driver can monitor for updates to arp_tbl using the netevent notifier +NETEVENT_NEIGH_UPDATE. The device can be programmed with resolved nexthops +for the routes as arp_tbl updates. The driver implements ndo_neigh_destroy +to know when arp_tbl neighbor entries are purged from the port. diff --git a/Documentation/networking/switchdev.txt b/Documentation/networking/switchdev.txt deleted file mode 100644 index 86174ce8cd13..000000000000 --- a/Documentation/networking/switchdev.txt +++ /dev/null @@ -1,373 +0,0 @@ -Ethernet switch device driver model (switchdev) -=============================================== -Copyright (c) 2014 Jiri Pirko -Copyright (c) 2014-2015 Scott Feldman - - -The Ethernet switch device driver model (switchdev) is an in-kernel driver -model for switch devices which offload the forwarding (data) plane from the -kernel. - -Figure 1 is a block diagram showing the components of the switchdev model for -an example setup using a data-center-class switch ASIC chip. Other setups -with SR-IOV or soft switches, such as OVS, are possible. - - - User-space tools - - user space | - +-------------------------------------------------------------------+ - kernel | Netlink - | - +--------------+-------------------------------+ - | Network stack | - | (Linux) | - | | - +----------------------------------------------+ - - sw1p2 sw1p4 sw1p6 - sw1p1 + sw1p3 + sw1p5 + eth1 - + | + | + | + - | | | | | | | - +--+----+----+----+----+----+---+ +-----+-----+ - | Switch driver | | mgmt | - | (this document) | | driver | - | | | | - +--------------+----------------+ +-----------+ - | - kernel | HW bus (eg PCI) - +-------------------------------------------------------------------+ - hardware | - +--------------+----------------+ - | Switch device (sw1) | - | +----+ +--------+ - | | v offloaded data path | mgmt port - | | | | - +--|----|----+----+----+----+---+ - | | | | | | - + + + + + + - p1 p2 p3 p4 p5 p6 - - front-panel ports - - - Fig 1. - - -Include Files -------------- - -#include -#include - - -Configuration -------------- - -Use "depends NET_SWITCHDEV" in driver's Kconfig to ensure switchdev model -support is built for driver. - - -Switch Ports ------------- - -On switchdev driver initialization, the driver will allocate and register a -struct net_device (using register_netdev()) for each enumerated physical switch -port, called the port netdev. A port netdev is the software representation of -the physical port and provides a conduit for control traffic to/from the -controller (the kernel) and the network, as well as an anchor point for higher -level constructs such as bridges, bonds, VLANs, tunnels, and L3 routers. Using -standard netdev tools (iproute2, ethtool, etc), the port netdev can also -provide to the user access to the physical properties of the switch port such -as PHY link state and I/O statistics. - -There is (currently) no higher-level kernel object for the switch beyond the -port netdevs. All of the switchdev driver ops are netdev ops or switchdev ops. - -A switch management port is outside the scope of the switchdev driver model. -Typically, the management port is not participating in offloaded data plane and -is loaded with a different driver, such as a NIC driver, on the management port -device. - -Switch ID -^^^^^^^^^ - -The switchdev driver must implement the net_device operation -ndo_get_port_parent_id for each port netdev, returning the same physical ID for -each port of a switch. The ID must be unique between switches on the same -system. The ID does not need to be unique between switches on different -systems. - -The switch ID is used to locate ports on a switch and to know if aggregated -ports belong to the same switch. - -Port Netdev Naming -^^^^^^^^^^^^^^^^^^ - -Udev rules should be used for port netdev naming, using some unique attribute -of the port as a key, for example the port MAC address or the port PHYS name. -Hard-coding of kernel netdev names within the driver is discouraged; let the -kernel pick the default netdev name, and let udev set the final name based on a -port attribute. - -Using port PHYS name (ndo_get_phys_port_name) for the key is particularly -useful for dynamically-named ports where the device names its ports based on -external configuration. For example, if a physical 40G port is split logically -into 4 10G ports, resulting in 4 port netdevs, the device can give a unique -name for each port using port PHYS name. The udev rule would be: - -SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="", \ - ATTR{phys_port_name}!="", NAME="swX$attr{phys_port_name}" - -Suggested naming convention is "swXpYsZ", where X is the switch name or ID, Y -is the port name or ID, and Z is the sub-port name or ID. For example, sw1p1s0 -would be sub-port 0 on port 1 on switch 1. - -Port Features -^^^^^^^^^^^^^ - -NETIF_F_NETNS_LOCAL - -If the switchdev driver (and device) only supports offloading of the default -network namespace (netns), the driver should set this feature flag to prevent -the port netdev from being moved out of the default netns. A netns-aware -driver/device would not set this flag and be responsible for partitioning -hardware to preserve netns containment. This means hardware cannot forward -traffic from a port in one namespace to another port in another namespace. - -Port Topology -^^^^^^^^^^^^^ - -The port netdevs representing the physical switch ports can be organized into -higher-level switching constructs. The default construct is a standalone -router port, used to offload L3 forwarding. Two or more ports can be bonded -together to form a LAG. Two or more ports (or LAGs) can be bridged to bridge -L2 networks. VLANs can be applied to sub-divide L2 networks. L2-over-L3 -tunnels can be built on ports. These constructs are built using standard Linux -tools such as the bridge driver, the bonding/team drivers, and netlink-based -tools such as iproute2. - -The switchdev driver can know a particular port's position in the topology by -monitoring NETDEV_CHANGEUPPER notifications. For example, a port moved into a -bond will see it's upper master change. If that bond is moved into a bridge, -the bond's upper master will change. And so on. The driver will track such -movements to know what position a port is in in the overall topology by -registering for netdevice events and acting on NETDEV_CHANGEUPPER. - -L2 Forwarding Offload ---------------------- - -The idea is to offload the L2 data forwarding (switching) path from the kernel -to the switchdev device by mirroring bridge FDB entries down to the device. An -FDB entry is the {port, MAC, VLAN} tuple forwarding destination. - -To offloading L2 bridging, the switchdev driver/device should support: - - - Static FDB entries installed on a bridge port - - Notification of learned/forgotten src mac/vlans from device - - STP state changes on the port - - VLAN flooding of multicast/broadcast and unknown unicast packets - -Static FDB Entries -^^^^^^^^^^^^^^^^^^ - -The switchdev driver should implement ndo_fdb_add, ndo_fdb_del and ndo_fdb_dump -to support static FDB entries installed to the device. Static bridge FDB -entries are installed, for example, using iproute2 bridge cmd: - - bridge fdb add ADDR dev DEV [vlan VID] [self] - -The driver should use the helper switchdev_port_fdb_xxx ops for ndo_fdb_xxx -ops, and handle add/delete/dump of SWITCHDEV_OBJ_ID_PORT_FDB object using -switchdev_port_obj_xxx ops. - -XXX: what should be done if offloading this rule to hardware fails (for -example, due to full capacity in hardware tables) ? - -Note: by default, the bridge does not filter on VLAN and only bridges untagged -traffic. To enable VLAN support, turn on VLAN filtering: - - echo 1 >/sys/class/net//bridge/vlan_filtering - -Notification of Learned/Forgotten Source MAC/VLANs -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The switch device will learn/forget source MAC address/VLAN on ingress packets -and notify the switch driver of the mac/vlan/port tuples. The switch driver, -in turn, will notify the bridge driver using the switchdev notifier call: - - err = call_switchdev_notifiers(val, dev, info, extack); - -Where val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when -forgetting, and info points to a struct switchdev_notifier_fdb_info. On -SWITCHDEV_FDB_ADD, the bridge driver will install the FDB entry into the -bridge's FDB and mark the entry as NTF_EXT_LEARNED. The iproute2 bridge -command will label these entries "offload": - - $ bridge fdb - 52:54:00:12:35:01 dev sw1p1 master br0 permanent - 00:02:00:00:02:00 dev sw1p1 master br0 offload - 00:02:00:00:02:00 dev sw1p1 self - 52:54:00:12:35:02 dev sw1p2 master br0 permanent - 00:02:00:00:03:00 dev sw1p2 master br0 offload - 00:02:00:00:03:00 dev sw1p2 self - 33:33:00:00:00:01 dev eth0 self permanent - 01:00:5e:00:00:01 dev eth0 self permanent - 33:33:ff:00:00:00 dev eth0 self permanent - 01:80:c2:00:00:0e dev eth0 self permanent - 33:33:00:00:00:01 dev br0 self permanent - 01:00:5e:00:00:01 dev br0 self permanent - 33:33:ff:12:35:01 dev br0 self permanent - -Learning on the port should be disabled on the bridge using the bridge command: - - bridge link set dev DEV learning off - -Learning on the device port should be enabled, as well as learning_sync: - - bridge link set dev DEV learning on self - bridge link set dev DEV learning_sync on self - -Learning_sync attribute enables syncing of the learned/forgotten FDB entry to -the bridge's FDB. It's possible, but not optimal, to enable learning on the -device port and on the bridge port, and disable learning_sync. - -To support learning, the driver implements switchdev op -switchdev_port_attr_set for SWITCHDEV_ATTR_PORT_ID_{PRE}_BRIDGE_FLAGS. - -FDB Ageing -^^^^^^^^^^ - -The bridge will skip ageing FDB entries marked with NTF_EXT_LEARNED and it is -the responsibility of the port driver/device to age out these entries. If the -port device supports ageing, when the FDB entry expires, it will notify the -driver which in turn will notify the bridge with SWITCHDEV_FDB_DEL. If the -device does not support ageing, the driver can simulate ageing using a -garbage collection timer to monitor FDB entries. Expired entries will be -notified to the bridge using SWITCHDEV_FDB_DEL. See rocker driver for -example of driver running ageing timer. - -To keep an NTF_EXT_LEARNED entry "alive", the driver should refresh the FDB -entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...). The -notification will reset the FDB entry's last-used time to now. The driver -should rate limit refresh notifications, for example, no more than once a -second. (The last-used time is visible using the bridge -s fdb option). - -STP State Change on Port -^^^^^^^^^^^^^^^^^^^^^^^^ - -Internally or with a third-party STP protocol implementation (e.g. mstpd), the -bridge driver maintains the STP state for ports, and will notify the switch -driver of STP state change on a port using the switchdev op -switchdev_attr_port_set for SWITCHDEV_ATTR_PORT_ID_STP_UPDATE. - -State is one of BR_STATE_*. The switch driver can use STP state updates to -update ingress packet filter list for the port. For example, if port is -DISABLED, no packets should pass, but if port moves to BLOCKED, then STP BPDUs -and other IEEE 01:80:c2:xx:xx:xx link-local multicast packets can pass. - -Note that STP BDPUs are untagged and STP state applies to all VLANs on the port -so packet filters should be applied consistently across untagged and tagged -VLANs on the port. - -Flooding L2 domain -^^^^^^^^^^^^^^^^^^ - -For a given L2 VLAN domain, the switch device should flood multicast/broadcast -and unknown unicast packets to all ports in domain, if allowed by port's -current STP state. The switch driver, knowing which ports are within which -vlan L2 domain, can program the switch device for flooding. The packet may -be sent to the port netdev for processing by the bridge driver. The -bridge should not reflood the packet to the same ports the device flooded, -otherwise there will be duplicate packets on the wire. - -To avoid duplicate packets, the switch driver should mark a packet as already -forwarded by setting the skb->offload_fwd_mark bit. The bridge driver will mark -the skb using the ingress bridge port's mark and prevent it from being forwarded -through any bridge port with the same mark. - -It is possible for the switch device to not handle flooding and push the -packets up to the bridge driver for flooding. This is not ideal as the number -of ports scale in the L2 domain as the device is much more efficient at -flooding packets that software. - -If supported by the device, flood control can be offloaded to it, preventing -certain netdevs from flooding unicast traffic for which there is no FDB entry. - -IGMP Snooping -^^^^^^^^^^^^^ - -In order to support IGMP snooping, the port netdevs should trap to the bridge -driver all IGMP join and leave messages. -The bridge multicast module will notify port netdevs on every multicast group -changed whether it is static configured or dynamically joined/leave. -The hardware implementation should be forwarding all registered multicast -traffic groups only to the configured ports. - -L3 Routing Offload ------------------- - -Offloading L3 routing requires that device be programmed with FIB entries from -the kernel, with the device doing the FIB lookup and forwarding. The device -does a longest prefix match (LPM) on FIB entries matching route prefix and -forwards the packet to the matching FIB entry's nexthop(s) egress ports. - -To program the device, the driver has to register a FIB notifier handler -using register_fib_notifier. The following events are available: -FIB_EVENT_ENTRY_ADD: used for both adding a new FIB entry to the device, - or modifying an existing entry on the device. -FIB_EVENT_ENTRY_DEL: used for removing a FIB entry -FIB_EVENT_RULE_ADD, FIB_EVENT_RULE_DEL: used to propagate FIB rule changes - -FIB_EVENT_ENTRY_ADD and FIB_EVENT_ENTRY_DEL events pass: - - struct fib_entry_notifier_info { - struct fib_notifier_info info; /* must be first */ - u32 dst; - int dst_len; - struct fib_info *fi; - u8 tos; - u8 type; - u32 tb_id; - u32 nlflags; - }; - -to add/modify/delete IPv4 dst/dest_len prefix on table tb_id. The *fi -structure holds details on the route and route's nexthops. *dev is one of the -port netdevs mentioned in the route's next hop list. - -Routes offloaded to the device are labeled with "offload" in the ip route -listing: - - $ ip route show - default via 192.168.0.2 dev eth0 - 11.0.0.0/30 dev sw1p1 proto kernel scope link src 11.0.0.2 offload - 11.0.0.4/30 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload - 11.0.0.8/30 dev sw1p2 proto kernel scope link src 11.0.0.10 offload - 11.0.0.12/30 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload - 12.0.0.2 proto zebra metric 30 offload - nexthop via 11.0.0.1 dev sw1p1 weight 1 - nexthop via 11.0.0.9 dev sw1p2 weight 1 - 12.0.0.3 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload - 12.0.0.4 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload - 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.15 - -The "offload" flag is set in case at least one device offloads the FIB entry. - -XXX: add/mod/del IPv6 FIB API - -Nexthop Resolution -^^^^^^^^^^^^^^^^^^ - -The FIB entry's nexthop list contains the nexthop tuple (gateway, dev), but for -the switch device to forward the packet with the correct dst mac address, the -nexthop gateways must be resolved to the neighbor's mac address. Neighbor mac -address discovery comes via the ARP (or ND) process and is available via the -arp_tbl neighbor table. To resolve the routes nexthop gateways, the driver -should trigger the kernel's neighbor resolution process. See the rocker -driver's rocker_port_ipv4_resolve() for an example. - -The driver can monitor for updates to arp_tbl using the netevent notifier -NETEVENT_NEIGH_UPDATE. The device can be programmed with resolved nexthops -for the routes as arp_tbl updates. The driver implements ndo_neigh_destroy -to know when arp_tbl neighbor entries are purged from the port. diff --git a/drivers/staging/fsl-dpaa2/ethsw/README b/drivers/staging/fsl-dpaa2/ethsw/README index f6fc07f780d1..b48dcbf7c5fb 100644 --- a/drivers/staging/fsl-dpaa2/ethsw/README +++ b/drivers/staging/fsl-dpaa2/ethsw/README @@ -79,7 +79,7 @@ The DPSW can have ports connected to DPNIs or to PHYs via DPMACs. For a more detailed description of the Ethernet switch device driver model see: - Documentation/networking/switchdev.txt + Documentation/networking/switchdev.rst Creating an Ethernet Switch =========================== -- cgit v1.2.3 From d2461edde7d15121a2cd39d48cd39edbd3fa019a Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:28 +0200 Subject: docs: networking: convert tc-actions-env-rules.txt to ReST - add SPDX header; - add a document title; - use the right numbered list markup; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/tc-actions-env-rules.rst | 29 +++++++++++++++++++++++ Documentation/networking/tc-actions-env-rules.txt | 24 ------------------- 3 files changed, 30 insertions(+), 24 deletions(-) create mode 100644 Documentation/networking/tc-actions-env-rules.rst delete mode 100644 Documentation/networking/tc-actions-env-rules.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 5e495804f96f..f53d89b5679a 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -106,6 +106,7 @@ Contents: skfp strparser switchdev + tc-actions-env-rules .. only:: subproject and html diff --git a/Documentation/networking/tc-actions-env-rules.rst b/Documentation/networking/tc-actions-env-rules.rst new file mode 100644 index 000000000000..86884b8fb4e0 --- /dev/null +++ b/Documentation/networking/tc-actions-env-rules.rst @@ -0,0 +1,29 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================ +TC Actions - Environmental Rules +================================ + + +The "environmental" rules for authors of any new tc actions are: + +1) If you stealeth or borroweth any packet thou shalt be branching + from the righteous path and thou shalt cloneth. + + For example if your action queues a packet to be processed later, + or intentionally branches by redirecting a packet, then you need to + clone the packet. + +2) If you munge any packet thou shalt call pskb_expand_head in the case + someone else is referencing the skb. After that you "own" the skb. + +3) Dropping packets you don't own is a no-no. You simply return + TC_ACT_SHOT to the caller and they will drop it. + +The "environmental" rules for callers of actions (qdiscs etc) are: + +#) Thou art responsible for freeing anything returned as being + TC_ACT_SHOT/STOLEN/QUEUED. If none of TC_ACT_SHOT/STOLEN/QUEUED is + returned, then all is great and you don't need to do anything. + +Post on netdev if something is unclear. diff --git a/Documentation/networking/tc-actions-env-rules.txt b/Documentation/networking/tc-actions-env-rules.txt deleted file mode 100644 index f37814693ad3..000000000000 --- a/Documentation/networking/tc-actions-env-rules.txt +++ /dev/null @@ -1,24 +0,0 @@ - -The "environmental" rules for authors of any new tc actions are: - -1) If you stealeth or borroweth any packet thou shalt be branching -from the righteous path and thou shalt cloneth. - -For example if your action queues a packet to be processed later, -or intentionally branches by redirecting a packet, then you need to -clone the packet. - -2) If you munge any packet thou shalt call pskb_expand_head in the case -someone else is referencing the skb. After that you "own" the skb. - -3) Dropping packets you don't own is a no-no. You simply return -TC_ACT_SHOT to the caller and they will drop it. - -The "environmental" rules for callers of actions (qdiscs etc) are: - -*) Thou art responsible for freeing anything returned as being -TC_ACT_SHOT/STOLEN/QUEUED. If none of TC_ACT_SHOT/STOLEN/QUEUED is -returned, then all is great and you don't need to do anything. - -Post on netdev if something is unclear. - -- cgit v1.2.3 From ff159f4f1152b7012d378c69ffd253ed53227865 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:29 +0200 Subject: docs: networking: convert tcp-thin.txt to ReST Not much to be done here: - add SPDX header; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/ip-sysctl.rst | 2 +- Documentation/networking/tcp-thin.rst | 52 ++++++++++++++++++++++++++++++++++ Documentation/networking/tcp-thin.txt | 47 ------------------------------ 4 files changed, 54 insertions(+), 48 deletions(-) create mode 100644 Documentation/networking/tcp-thin.rst delete mode 100644 Documentation/networking/tcp-thin.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index f53d89b5679a..89b02fbfc2eb 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -107,6 +107,7 @@ Contents: strparser switchdev tc-actions-env-rules + tcp-thin .. only:: subproject and html diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 38f811d4b2f0..3266aee9e052 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -886,7 +886,7 @@ tcp_thin_linear_timeouts - BOOLEAN initiated. This improves retransmission latency for non-aggressive thin streams, often found to be time-dependent. For more information on thin streams, see - Documentation/networking/tcp-thin.txt + Documentation/networking/tcp-thin.rst Default: 0 diff --git a/Documentation/networking/tcp-thin.rst b/Documentation/networking/tcp-thin.rst new file mode 100644 index 000000000000..b06765c96ea1 --- /dev/null +++ b/Documentation/networking/tcp-thin.rst @@ -0,0 +1,52 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +Thin-streams and TCP +==================== + +A wide range of Internet-based services that use reliable transport +protocols display what we call thin-stream properties. This means +that the application sends data with such a low rate that the +retransmission mechanisms of the transport protocol are not fully +effective. In time-dependent scenarios (like online games, control +systems, stock trading etc.) where the user experience depends +on the data delivery latency, packet loss can be devastating for +the service quality. Extreme latencies are caused by TCP's +dependency on the arrival of new data from the application to trigger +retransmissions effectively through fast retransmit instead of +waiting for long timeouts. + +After analysing a large number of time-dependent interactive +applications, we have seen that they often produce thin streams +and also stay with this traffic pattern throughout its entire +lifespan. The combination of time-dependency and the fact that the +streams provoke high latencies when using TCP is unfortunate. + +In order to reduce application-layer latency when packets are lost, +a set of mechanisms has been made, which address these latency issues +for thin streams. In short, if the kernel detects a thin stream, +the retransmission mechanisms are modified in the following manner: + +1) If the stream is thin, fast retransmit on the first dupACK. +2) If the stream is thin, do not apply exponential backoff. + +These enhancements are applied only if the stream is detected as +thin. This is accomplished by defining a threshold for the number +of packets in flight. If there are less than 4 packets in flight, +fast retransmissions can not be triggered, and the stream is prone +to experience high retransmission latencies. + +Since these mechanisms are targeted at time-dependent applications, +they must be specifically activated by the application using the +TCP_THIN_LINEAR_TIMEOUTS and TCP_THIN_DUPACK IOCTLS or the +tcp_thin_linear_timeouts and tcp_thin_dupack sysctls. Both +modifications are turned off by default. + +References +========== +More information on the modifications, as well as a wide range of +experimental data can be found here: + +"Improving latency for interactive, thin-stream applications over +reliable transport" +http://simula.no/research/nd/publications/Simula.nd.477/simula_pdf_file diff --git a/Documentation/networking/tcp-thin.txt b/Documentation/networking/tcp-thin.txt deleted file mode 100644 index 151e229980f1..000000000000 --- a/Documentation/networking/tcp-thin.txt +++ /dev/null @@ -1,47 +0,0 @@ -Thin-streams and TCP -==================== -A wide range of Internet-based services that use reliable transport -protocols display what we call thin-stream properties. This means -that the application sends data with such a low rate that the -retransmission mechanisms of the transport protocol are not fully -effective. In time-dependent scenarios (like online games, control -systems, stock trading etc.) where the user experience depends -on the data delivery latency, packet loss can be devastating for -the service quality. Extreme latencies are caused by TCP's -dependency on the arrival of new data from the application to trigger -retransmissions effectively through fast retransmit instead of -waiting for long timeouts. - -After analysing a large number of time-dependent interactive -applications, we have seen that they often produce thin streams -and also stay with this traffic pattern throughout its entire -lifespan. The combination of time-dependency and the fact that the -streams provoke high latencies when using TCP is unfortunate. - -In order to reduce application-layer latency when packets are lost, -a set of mechanisms has been made, which address these latency issues -for thin streams. In short, if the kernel detects a thin stream, -the retransmission mechanisms are modified in the following manner: - -1) If the stream is thin, fast retransmit on the first dupACK. -2) If the stream is thin, do not apply exponential backoff. - -These enhancements are applied only if the stream is detected as -thin. This is accomplished by defining a threshold for the number -of packets in flight. If there are less than 4 packets in flight, -fast retransmissions can not be triggered, and the stream is prone -to experience high retransmission latencies. - -Since these mechanisms are targeted at time-dependent applications, -they must be specifically activated by the application using the -TCP_THIN_LINEAR_TIMEOUTS and TCP_THIN_DUPACK IOCTLS or the -tcp_thin_linear_timeouts and tcp_thin_dupack sysctls. Both -modifications are turned off by default. - -References -========== -More information on the modifications, as well as a wide range of -experimental data can be found here: -"Improving latency for interactive, thin-stream applications over -reliable transport" -http://simula.no/research/nd/publications/Simula.nd.477/simula_pdf_file -- cgit v1.2.3 From aa8a6ee3e3fc4001e952de37660fe71826da8189 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:30 +0200 Subject: docs: networking: convert team.txt to ReST Not much to be done here: - add SPDX header; - add a document title; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/team.rst | 8 ++++++++ Documentation/networking/team.txt | 2 -- 3 files changed, 9 insertions(+), 2 deletions(-) create mode 100644 Documentation/networking/team.rst delete mode 100644 Documentation/networking/team.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 89b02fbfc2eb..be65ee509669 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -108,6 +108,7 @@ Contents: switchdev tc-actions-env-rules tcp-thin + team .. only:: subproject and html diff --git a/Documentation/networking/team.rst b/Documentation/networking/team.rst new file mode 100644 index 000000000000..0a7f3a059586 --- /dev/null +++ b/Documentation/networking/team.rst @@ -0,0 +1,8 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==== +Team +==== + +Team devices are driven from userspace via libteam library which is here: + https://github.com/jpirko/libteam diff --git a/Documentation/networking/team.txt b/Documentation/networking/team.txt deleted file mode 100644 index 5a013686b9ea..000000000000 --- a/Documentation/networking/team.txt +++ /dev/null @@ -1,2 +0,0 @@ -Team devices are driven from userspace via libteam library which is here: - https://github.com/jpirko/libteam -- cgit v1.2.3 From 06bfa47e72c83550fefc93c62a1ace5fff72e212 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:31 +0200 Subject: docs: networking: convert timestamping.txt to ReST - add SPDX header; - add a document title; - adjust titles and chapters, adding proper markups; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/packet_mmap.rst | 4 +- Documentation/networking/timestamping.rst | 591 ++++++++++++++++++++++++++++++ Documentation/networking/timestamping.txt | 571 ----------------------------- include/uapi/linux/errqueue.h | 2 +- 5 files changed, 595 insertions(+), 574 deletions(-) create mode 100644 Documentation/networking/timestamping.rst delete mode 100644 Documentation/networking/timestamping.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index be65ee509669..8f9a84b8e3f2 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -109,6 +109,7 @@ Contents: tc-actions-env-rules tcp-thin team + timestamping .. only:: subproject and html diff --git a/Documentation/networking/packet_mmap.rst b/Documentation/networking/packet_mmap.rst index 884c7222b9e9..6c009ceb1183 100644 --- a/Documentation/networking/packet_mmap.rst +++ b/Documentation/networking/packet_mmap.rst @@ -1030,7 +1030,7 @@ the packet meta information for mmap(2)ed RX_RING and TX_RINGs. If your NIC is capable of timestamping packets in hardware, you can request those hardware timestamps to be used. Note: you may need to enable the generation of hardware timestamps with SIOCSHWTSTAMP (see related information from -Documentation/networking/timestamping.txt). +Documentation/networking/timestamping.rst). PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING:: @@ -1069,7 +1069,7 @@ TX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec members do not contain a valid value. For TX_RINGs, by default no timestamp is generated! -See include/linux/net_tstamp.h and Documentation/networking/timestamping.txt +See include/linux/net_tstamp.h and Documentation/networking/timestamping.rst for more information on hardware timestamps. Miscellaneous bits diff --git a/Documentation/networking/timestamping.rst b/Documentation/networking/timestamping.rst new file mode 100644 index 000000000000..1adead6a4527 --- /dev/null +++ b/Documentation/networking/timestamping.rst @@ -0,0 +1,591 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============ +Timestamping +============ + + +1. Control Interfaces +===================== + +The interfaces for receiving network packages timestamps are: + +SO_TIMESTAMP + Generates a timestamp for each incoming packet in (not necessarily + monotonic) system time. Reports the timestamp via recvmsg() in a + control message in usec resolution. + SO_TIMESTAMP is defined as SO_TIMESTAMP_NEW or SO_TIMESTAMP_OLD + based on the architecture type and time_t representation of libc. + Control message format is in struct __kernel_old_timeval for + SO_TIMESTAMP_OLD and in struct __kernel_sock_timeval for + SO_TIMESTAMP_NEW options respectively. + +SO_TIMESTAMPNS + Same timestamping mechanism as SO_TIMESTAMP, but reports the + timestamp as struct timespec in nsec resolution. + SO_TIMESTAMPNS is defined as SO_TIMESTAMPNS_NEW or SO_TIMESTAMPNS_OLD + based on the architecture type and time_t representation of libc. + Control message format is in struct timespec for SO_TIMESTAMPNS_OLD + and in struct __kernel_timespec for SO_TIMESTAMPNS_NEW options + respectively. + +IP_MULTICAST_LOOP + SO_TIMESTAMP[NS] + Only for multicast:approximate transmit timestamp obtained by + reading the looped packet receive timestamp. + +SO_TIMESTAMPING + Generates timestamps on reception, transmission or both. Supports + multiple timestamp sources, including hardware. Supports generating + timestamps for stream sockets. + + +1.1 SO_TIMESTAMP (also SO_TIMESTAMP_OLD and SO_TIMESTAMP_NEW) +------------------------------------------------------------- + +This socket option enables timestamping of datagrams on the reception +path. Because the destination socket, if any, is not known early in +the network stack, the feature has to be enabled for all packets. The +same is true for all early receive timestamp options. + +For interface details, see `man 7 socket`. + +Always use SO_TIMESTAMP_NEW timestamp to always get timestamp in +struct __kernel_sock_timeval format. + +SO_TIMESTAMP_OLD returns incorrect timestamps after the year 2038 +on 32 bit machines. + +1.2 SO_TIMESTAMPNS (also SO_TIMESTAMPNS_OLD and SO_TIMESTAMPNS_NEW): + +This option is identical to SO_TIMESTAMP except for the returned data type. +Its struct timespec allows for higher resolution (ns) timestamps than the +timeval of SO_TIMESTAMP (ms). + +Always use SO_TIMESTAMPNS_NEW timestamp to always get timestamp in +struct __kernel_timespec format. + +SO_TIMESTAMPNS_OLD returns incorrect timestamps after the year 2038 +on 32 bit machines. + +1.3 SO_TIMESTAMPING (also SO_TIMESTAMPING_OLD and SO_TIMESTAMPING_NEW) +---------------------------------------------------------------------- + +Supports multiple types of timestamp requests. As a result, this +socket option takes a bitmap of flags, not a boolean. In:: + + err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val)); + +val is an integer with any of the following bits set. Setting other +bit returns EINVAL and does not change the current state. + +The socket option configures timestamp generation for individual +sk_buffs (1.3.1), timestamp reporting to the socket's error +queue (1.3.2) and options (1.3.3). Timestamp generation can also +be enabled for individual sendmsg calls using cmsg (1.3.4). + + +1.3.1 Timestamp Generation +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Some bits are requests to the stack to try to generate timestamps. Any +combination of them is valid. Changes to these bits apply to newly +created packets, not to packets already in the stack. As a result, it +is possible to selectively request timestamps for a subset of packets +(e.g., for sampling) by embedding an send() call within two setsockopt +calls, one to enable timestamp generation and one to disable it. +Timestamps may also be generated for reasons other than being +requested by a particular socket, such as when receive timestamping is +enabled system wide, as explained earlier. + +SOF_TIMESTAMPING_RX_HARDWARE: + Request rx timestamps generated by the network adapter. + +SOF_TIMESTAMPING_RX_SOFTWARE: + Request rx timestamps when data enters the kernel. These timestamps + are generated just after a device driver hands a packet to the + kernel receive stack. + +SOF_TIMESTAMPING_TX_HARDWARE: + Request tx timestamps generated by the network adapter. This flag + can be enabled via both socket options and control messages. + +SOF_TIMESTAMPING_TX_SOFTWARE: + Request tx timestamps when data leaves the kernel. These timestamps + are generated in the device driver as close as possible, but always + prior to, passing the packet to the network interface. Hence, they + require driver support and may not be available for all devices. + This flag can be enabled via both socket options and control messages. + +SOF_TIMESTAMPING_TX_SCHED: + Request tx timestamps prior to entering the packet scheduler. Kernel + transmit latency is, if long, often dominated by queuing delay. The + difference between this timestamp and one taken at + SOF_TIMESTAMPING_TX_SOFTWARE will expose this latency independent + of protocol processing. The latency incurred in protocol + processing, if any, can be computed by subtracting a userspace + timestamp taken immediately before send() from this timestamp. On + machines with virtual devices where a transmitted packet travels + through multiple devices and, hence, multiple packet schedulers, + a timestamp is generated at each layer. This allows for fine + grained measurement of queuing delay. This flag can be enabled + via both socket options and control messages. + +SOF_TIMESTAMPING_TX_ACK: + Request tx timestamps when all data in the send buffer has been + acknowledged. This only makes sense for reliable protocols. It is + currently only implemented for TCP. For that protocol, it may + over-report measurement, because the timestamp is generated when all + data up to and including the buffer at send() was acknowledged: the + cumulative acknowledgment. The mechanism ignores SACK and FACK. + This flag can be enabled via both socket options and control messages. + + +1.3.2 Timestamp Reporting +^^^^^^^^^^^^^^^^^^^^^^^^^ + +The other three bits control which timestamps will be reported in a +generated control message. Changes to the bits take immediate +effect at the timestamp reporting locations in the stack. Timestamps +are only reported for packets that also have the relevant timestamp +generation request set. + +SOF_TIMESTAMPING_SOFTWARE: + Report any software timestamps when available. + +SOF_TIMESTAMPING_SYS_HARDWARE: + This option is deprecated and ignored. + +SOF_TIMESTAMPING_RAW_HARDWARE: + Report hardware timestamps as generated by + SOF_TIMESTAMPING_TX_HARDWARE when available. + + +1.3.3 Timestamp Options +^^^^^^^^^^^^^^^^^^^^^^^ + +The interface supports the options + +SOF_TIMESTAMPING_OPT_ID: + Generate a unique identifier along with each packet. A process can + have multiple concurrent timestamping requests outstanding. Packets + can be reordered in the transmit path, for instance in the packet + scheduler. In that case timestamps will be queued onto the error + queue out of order from the original send() calls. It is not always + possible to uniquely match timestamps to the original send() calls + based on timestamp order or payload inspection alone, then. + + This option associates each packet at send() with a unique + identifier and returns that along with the timestamp. The identifier + is derived from a per-socket u32 counter (that wraps). For datagram + sockets, the counter increments with each sent packet. For stream + sockets, it increments with every byte. + + The counter starts at zero. It is initialized the first time that + the socket option is enabled. It is reset each time the option is + enabled after having been disabled. Resetting the counter does not + change the identifiers of existing packets in the system. + + This option is implemented only for transmit timestamps. There, the + timestamp is always looped along with a struct sock_extended_err. + The option modifies field ee_data to pass an id that is unique + among all possibly concurrently outstanding timestamp requests for + that socket. + + +SOF_TIMESTAMPING_OPT_CMSG: + Support recv() cmsg for all timestamped packets. Control messages + are already supported unconditionally on all packets with receive + timestamps and on IPv6 packets with transmit timestamp. This option + extends them to IPv4 packets with transmit timestamp. One use case + is to correlate packets with their egress device, by enabling socket + option IP_PKTINFO simultaneously. + + +SOF_TIMESTAMPING_OPT_TSONLY: + Applies to transmit timestamps only. Makes the kernel return the + timestamp as a cmsg alongside an empty packet, as opposed to + alongside the original packet. This reduces the amount of memory + charged to the socket's receive budget (SO_RCVBUF) and delivers + the timestamp even if sysctl net.core.tstamp_allow_data is 0. + This option disables SOF_TIMESTAMPING_OPT_CMSG. + +SOF_TIMESTAMPING_OPT_STATS: + Optional stats that are obtained along with the transmit timestamps. + It must be used together with SOF_TIMESTAMPING_OPT_TSONLY. When the + transmit timestamp is available, the stats are available in a + separate control message of type SCM_TIMESTAMPING_OPT_STATS, as a + list of TLVs (struct nlattr) of types. These stats allow the + application to associate various transport layer stats with + the transmit timestamps, such as how long a certain block of + data was limited by peer's receiver window. + +SOF_TIMESTAMPING_OPT_PKTINFO: + Enable the SCM_TIMESTAMPING_PKTINFO control message for incoming + packets with hardware timestamps. The message contains struct + scm_ts_pktinfo, which supplies the index of the real interface which + received the packet and its length at layer 2. A valid (non-zero) + interface index will be returned only if CONFIG_NET_RX_BUSY_POLL is + enabled and the driver is using NAPI. The struct contains also two + other fields, but they are reserved and undefined. + +SOF_TIMESTAMPING_OPT_TX_SWHW: + Request both hardware and software timestamps for outgoing packets + when SOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE + are enabled at the same time. If both timestamps are generated, + two separate messages will be looped to the socket's error queue, + each containing just one timestamp. + +New applications are encouraged to pass SOF_TIMESTAMPING_OPT_ID to +disambiguate timestamps and SOF_TIMESTAMPING_OPT_TSONLY to operate +regardless of the setting of sysctl net.core.tstamp_allow_data. + +An exception is when a process needs additional cmsg data, for +instance SOL_IP/IP_PKTINFO to detect the egress network interface. +Then pass option SOF_TIMESTAMPING_OPT_CMSG. This option depends on +having access to the contents of the original packet, so cannot be +combined with SOF_TIMESTAMPING_OPT_TSONLY. + + +1.3.4. Enabling timestamps via control messages +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +In addition to socket options, timestamp generation can be requested +per write via cmsg, only for SOF_TIMESTAMPING_TX_* (see Section 1.3.1). +Using this feature, applications can sample timestamps per sendmsg() +without paying the overhead of enabling and disabling timestamps via +setsockopt:: + + struct msghdr *msg; + ... + cmsg = CMSG_FIRSTHDR(msg); + cmsg->cmsg_level = SOL_SOCKET; + cmsg->cmsg_type = SO_TIMESTAMPING; + cmsg->cmsg_len = CMSG_LEN(sizeof(__u32)); + *((__u32 *) CMSG_DATA(cmsg)) = SOF_TIMESTAMPING_TX_SCHED | + SOF_TIMESTAMPING_TX_SOFTWARE | + SOF_TIMESTAMPING_TX_ACK; + err = sendmsg(fd, msg, 0); + +The SOF_TIMESTAMPING_TX_* flags set via cmsg will override +the SOF_TIMESTAMPING_TX_* flags set via setsockopt. + +Moreover, applications must still enable timestamp reporting via +setsockopt to receive timestamps:: + + __u32 val = SOF_TIMESTAMPING_SOFTWARE | + SOF_TIMESTAMPING_OPT_ID /* or any other flag */; + err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val)); + + +1.4 Bytestream Timestamps +------------------------- + +The SO_TIMESTAMPING interface supports timestamping of bytes in a +bytestream. Each request is interpreted as a request for when the +entire contents of the buffer has passed a timestamping point. That +is, for streams option SOF_TIMESTAMPING_TX_SOFTWARE will record +when all bytes have reached the device driver, regardless of how +many packets the data has been converted into. + +In general, bytestreams have no natural delimiters and therefore +correlating a timestamp with data is non-trivial. A range of bytes +may be split across segments, any segments may be merged (possibly +coalescing sections of previously segmented buffers associated with +independent send() calls). Segments can be reordered and the same +byte range can coexist in multiple segments for protocols that +implement retransmissions. + +It is essential that all timestamps implement the same semantics, +regardless of these possible transformations, as otherwise they are +incomparable. Handling "rare" corner cases differently from the +simple case (a 1:1 mapping from buffer to skb) is insufficient +because performance debugging often needs to focus on such outliers. + +In practice, timestamps can be correlated with segments of a +bytestream consistently, if both semantics of the timestamp and the +timing of measurement are chosen correctly. This challenge is no +different from deciding on a strategy for IP fragmentation. There, the +definition is that only the first fragment is timestamped. For +bytestreams, we chose that a timestamp is generated only when all +bytes have passed a point. SOF_TIMESTAMPING_TX_ACK as defined is easy to +implement and reason about. An implementation that has to take into +account SACK would be more complex due to possible transmission holes +and out of order arrival. + +On the host, TCP can also break the simple 1:1 mapping from buffer to +skbuff as a result of Nagle, cork, autocork, segmentation and GSO. The +implementation ensures correctness in all cases by tracking the +individual last byte passed to send(), even if it is no longer the +last byte after an skbuff extend or merge operation. It stores the +relevant sequence number in skb_shinfo(skb)->tskey. Because an skbuff +has only one such field, only one timestamp can be generated. + +In rare cases, a timestamp request can be missed if two requests are +collapsed onto the same skb. A process can detect this situation by +enabling SOF_TIMESTAMPING_OPT_ID and comparing the byte offset at +send time with the value returned for each timestamp. It can prevent +the situation by always flushing the TCP stack in between requests, +for instance by enabling TCP_NODELAY and disabling TCP_CORK and +autocork. + +These precautions ensure that the timestamp is generated only when all +bytes have passed a timestamp point, assuming that the network stack +itself does not reorder the segments. The stack indeed tries to avoid +reordering. The one exception is under administrator control: it is +possible to construct a packet scheduler configuration that delays +segments from the same stream differently. Such a setup would be +unusual. + + +2 Data Interfaces +================== + +Timestamps are read using the ancillary data feature of recvmsg(). +See `man 3 cmsg` for details of this interface. The socket manual +page (`man 7 socket`) describes how timestamps generated with +SO_TIMESTAMP and SO_TIMESTAMPNS records can be retrieved. + + +2.1 SCM_TIMESTAMPING records +---------------------------- + +These timestamps are returned in a control message with cmsg_level +SOL_SOCKET, cmsg_type SCM_TIMESTAMPING, and payload of type + +For SO_TIMESTAMPING_OLD:: + + struct scm_timestamping { + struct timespec ts[3]; + }; + +For SO_TIMESTAMPING_NEW:: + + struct scm_timestamping64 { + struct __kernel_timespec ts[3]; + +Always use SO_TIMESTAMPING_NEW timestamp to always get timestamp in +struct scm_timestamping64 format. + +SO_TIMESTAMPING_OLD returns incorrect timestamps after the year 2038 +on 32 bit machines. + +The structure can return up to three timestamps. This is a legacy +feature. At least one field is non-zero at any time. Most timestamps +are passed in ts[0]. Hardware timestamps are passed in ts[2]. + +ts[1] used to hold hardware timestamps converted to system time. +Instead, expose the hardware clock device on the NIC directly as +a HW PTP clock source, to allow time conversion in userspace and +optionally synchronize system time with a userspace PTP stack such +as linuxptp. For the PTP clock API, see Documentation/driver-api/ptp.rst. + +Note that if the SO_TIMESTAMP or SO_TIMESTAMPNS option is enabled +together with SO_TIMESTAMPING using SOF_TIMESTAMPING_SOFTWARE, a false +software timestamp will be generated in the recvmsg() call and passed +in ts[0] when a real software timestamp is missing. This happens also +on hardware transmit timestamps. + +2.1.1 Transmit timestamps with MSG_ERRQUEUE +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For transmit timestamps the outgoing packet is looped back to the +socket's error queue with the send timestamp(s) attached. A process +receives the timestamps by calling recvmsg() with flag MSG_ERRQUEUE +set and with a msg_control buffer sufficiently large to receive the +relevant metadata structures. The recvmsg call returns the original +outgoing data packet with two ancillary messages attached. + +A message of cm_level SOL_IP(V6) and cm_type IP(V6)_RECVERR +embeds a struct sock_extended_err. This defines the error type. For +timestamps, the ee_errno field is ENOMSG. The other ancillary message +will have cm_level SOL_SOCKET and cm_type SCM_TIMESTAMPING. This +embeds the struct scm_timestamping. + + +2.1.1.2 Timestamp types +~~~~~~~~~~~~~~~~~~~~~~~ + +The semantics of the three struct timespec are defined by field +ee_info in the extended error structure. It contains a value of +type SCM_TSTAMP_* to define the actual timestamp passed in +scm_timestamping. + +The SCM_TSTAMP_* types are 1:1 matches to the SOF_TIMESTAMPING_* +control fields discussed previously, with one exception. For legacy +reasons, SCM_TSTAMP_SND is equal to zero and can be set for both +SOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE. It +is the first if ts[2] is non-zero, the second otherwise, in which +case the timestamp is stored in ts[0]. + + +2.1.1.3 Fragmentation +~~~~~~~~~~~~~~~~~~~~~ + +Fragmentation of outgoing datagrams is rare, but is possible, e.g., by +explicitly disabling PMTU discovery. If an outgoing packet is fragmented, +then only the first fragment is timestamped and returned to the sending +socket. + + +2.1.1.4 Packet Payload +~~~~~~~~~~~~~~~~~~~~~~ + +The calling application is often not interested in receiving the whole +packet payload that it passed to the stack originally: the socket +error queue mechanism is just a method to piggyback the timestamp on. +In this case, the application can choose to read datagrams with a +smaller buffer, possibly even of length 0. The payload is truncated +accordingly. Until the process calls recvmsg() on the error queue, +however, the full packet is queued, taking up budget from SO_RCVBUF. + + +2.1.1.5 Blocking Read +~~~~~~~~~~~~~~~~~~~~~ + +Reading from the error queue is always a non-blocking operation. To +block waiting on a timestamp, use poll or select. poll() will return +POLLERR in pollfd.revents if any data is ready on the error queue. +There is no need to pass this flag in pollfd.events. This flag is +ignored on request. See also `man 2 poll`. + + +2.1.2 Receive timestamps +^^^^^^^^^^^^^^^^^^^^^^^^ + +On reception, there is no reason to read from the socket error queue. +The SCM_TIMESTAMPING ancillary data is sent along with the packet data +on a normal recvmsg(). Since this is not a socket error, it is not +accompanied by a message SOL_IP(V6)/IP(V6)_RECVERROR. In this case, +the meaning of the three fields in struct scm_timestamping is +implicitly defined. ts[0] holds a software timestamp if set, ts[1] +is again deprecated and ts[2] holds a hardware timestamp if set. + + +3. Hardware Timestamping configuration: SIOCSHWTSTAMP and SIOCGHWTSTAMP +======================================================================= + +Hardware time stamping must also be initialized for each device driver +that is expected to do hardware time stamping. The parameter is defined in +include/uapi/linux/net_tstamp.h as:: + + struct hwtstamp_config { + int flags; /* no flags defined right now, must be zero */ + int tx_type; /* HWTSTAMP_TX_* */ + int rx_filter; /* HWTSTAMP_FILTER_* */ + }; + +Desired behavior is passed into the kernel and to a specific device by +calling ioctl(SIOCSHWTSTAMP) with a pointer to a struct ifreq whose +ifr_data points to a struct hwtstamp_config. The tx_type and +rx_filter are hints to the driver what it is expected to do. If +the requested fine-grained filtering for incoming packets is not +supported, the driver may time stamp more than just the requested types +of packets. + +Drivers are free to use a more permissive configuration than the requested +configuration. It is expected that drivers should only implement directly the +most generic mode that can be supported. For example if the hardware can +support HWTSTAMP_FILTER_V2_EVENT, then it should generally always upscale +HWTSTAMP_FILTER_V2_L2_SYNC_MESSAGE, and so forth, as HWTSTAMP_FILTER_V2_EVENT +is more generic (and more useful to applications). + +A driver which supports hardware time stamping shall update the struct +with the actual, possibly more permissive configuration. If the +requested packets cannot be time stamped, then nothing should be +changed and ERANGE shall be returned (in contrast to EINVAL, which +indicates that SIOCSHWTSTAMP is not supported at all). + +Only a processes with admin rights may change the configuration. User +space is responsible to ensure that multiple processes don't interfere +with each other and that the settings are reset. + +Any process can read the actual configuration by passing this +structure to ioctl(SIOCGHWTSTAMP) in the same way. However, this has +not been implemented in all drivers. + +:: + + /* possible values for hwtstamp_config->tx_type */ + enum { + /* + * no outgoing packet will need hardware time stamping; + * should a packet arrive which asks for it, no hardware + * time stamping will be done + */ + HWTSTAMP_TX_OFF, + + /* + * enables hardware time stamping for outgoing packets; + * the sender of the packet decides which are to be + * time stamped by setting SOF_TIMESTAMPING_TX_SOFTWARE + * before sending the packet + */ + HWTSTAMP_TX_ON, + }; + + /* possible values for hwtstamp_config->rx_filter */ + enum { + /* time stamp no incoming packet at all */ + HWTSTAMP_FILTER_NONE, + + /* time stamp any incoming packet */ + HWTSTAMP_FILTER_ALL, + + /* return value: time stamp all packets requested plus some others */ + HWTSTAMP_FILTER_SOME, + + /* PTP v1, UDP, any kind of event packet */ + HWTSTAMP_FILTER_PTP_V1_L4_EVENT, + + /* for the complete list of values, please check + * the include file include/uapi/linux/net_tstamp.h + */ + }; + +3.1 Hardware Timestamping Implementation: Device Drivers +-------------------------------------------------------- + +A driver which supports hardware time stamping must support the +SIOCSHWTSTAMP ioctl and update the supplied struct hwtstamp_config with +the actual values as described in the section on SIOCSHWTSTAMP. It +should also support SIOCGHWTSTAMP. + +Time stamps for received packets must be stored in the skb. To get a pointer +to the shared time stamp structure of the skb call skb_hwtstamps(). Then +set the time stamps in the structure:: + + struct skb_shared_hwtstamps { + /* hardware time stamp transformed into duration + * since arbitrary point in time + */ + ktime_t hwtstamp; + }; + +Time stamps for outgoing packets are to be generated as follows: + +- In hard_start_xmit(), check if (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) + is set no-zero. If yes, then the driver is expected to do hardware time + stamping. +- If this is possible for the skb and requested, then declare + that the driver is doing the time stamping by setting the flag + SKBTX_IN_PROGRESS in skb_shinfo(skb)->tx_flags , e.g. with:: + + skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS; + + You might want to keep a pointer to the associated skb for the next step + and not free the skb. A driver not supporting hardware time stamping doesn't + do that. A driver must never touch sk_buff::tstamp! It is used to store + software generated time stamps by the network subsystem. +- Driver should call skb_tx_timestamp() as close to passing sk_buff to hardware + as possible. skb_tx_timestamp() provides a software time stamp if requested + and hardware timestamping is not possible (SKBTX_IN_PROGRESS not set). +- As soon as the driver has sent the packet and/or obtained a + hardware time stamp for it, it passes the time stamp back by + calling skb_hwtstamp_tx() with the original skb, the raw + hardware time stamp. skb_hwtstamp_tx() clones the original skb and + adds the timestamps, therefore the original skb has to be freed now. + If obtaining the hardware time stamp somehow fails, then the driver + should not fall back to software time stamping. The rationale is that + this would occur at a later time in the processing pipeline than other + software time stamping and therefore could lead to unexpected deltas + between time stamps. diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt deleted file mode 100644 index 8dd6333c3270..000000000000 --- a/Documentation/networking/timestamping.txt +++ /dev/null @@ -1,571 +0,0 @@ - -1. Control Interfaces - -The interfaces for receiving network packages timestamps are: - -* SO_TIMESTAMP - Generates a timestamp for each incoming packet in (not necessarily - monotonic) system time. Reports the timestamp via recvmsg() in a - control message in usec resolution. - SO_TIMESTAMP is defined as SO_TIMESTAMP_NEW or SO_TIMESTAMP_OLD - based on the architecture type and time_t representation of libc. - Control message format is in struct __kernel_old_timeval for - SO_TIMESTAMP_OLD and in struct __kernel_sock_timeval for - SO_TIMESTAMP_NEW options respectively. - -* SO_TIMESTAMPNS - Same timestamping mechanism as SO_TIMESTAMP, but reports the - timestamp as struct timespec in nsec resolution. - SO_TIMESTAMPNS is defined as SO_TIMESTAMPNS_NEW or SO_TIMESTAMPNS_OLD - based on the architecture type and time_t representation of libc. - Control message format is in struct timespec for SO_TIMESTAMPNS_OLD - and in struct __kernel_timespec for SO_TIMESTAMPNS_NEW options - respectively. - -* IP_MULTICAST_LOOP + SO_TIMESTAMP[NS] - Only for multicast:approximate transmit timestamp obtained by - reading the looped packet receive timestamp. - -* SO_TIMESTAMPING - Generates timestamps on reception, transmission or both. Supports - multiple timestamp sources, including hardware. Supports generating - timestamps for stream sockets. - - -1.1 SO_TIMESTAMP (also SO_TIMESTAMP_OLD and SO_TIMESTAMP_NEW): - -This socket option enables timestamping of datagrams on the reception -path. Because the destination socket, if any, is not known early in -the network stack, the feature has to be enabled for all packets. The -same is true for all early receive timestamp options. - -For interface details, see `man 7 socket`. - -Always use SO_TIMESTAMP_NEW timestamp to always get timestamp in -struct __kernel_sock_timeval format. - -SO_TIMESTAMP_OLD returns incorrect timestamps after the year 2038 -on 32 bit machines. - -1.2 SO_TIMESTAMPNS (also SO_TIMESTAMPNS_OLD and SO_TIMESTAMPNS_NEW): - -This option is identical to SO_TIMESTAMP except for the returned data type. -Its struct timespec allows for higher resolution (ns) timestamps than the -timeval of SO_TIMESTAMP (ms). - -Always use SO_TIMESTAMPNS_NEW timestamp to always get timestamp in -struct __kernel_timespec format. - -SO_TIMESTAMPNS_OLD returns incorrect timestamps after the year 2038 -on 32 bit machines. - -1.3 SO_TIMESTAMPING (also SO_TIMESTAMPING_OLD and SO_TIMESTAMPING_NEW): - -Supports multiple types of timestamp requests. As a result, this -socket option takes a bitmap of flags, not a boolean. In - - err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val)); - -val is an integer with any of the following bits set. Setting other -bit returns EINVAL and does not change the current state. - -The socket option configures timestamp generation for individual -sk_buffs (1.3.1), timestamp reporting to the socket's error -queue (1.3.2) and options (1.3.3). Timestamp generation can also -be enabled for individual sendmsg calls using cmsg (1.3.4). - - -1.3.1 Timestamp Generation - -Some bits are requests to the stack to try to generate timestamps. Any -combination of them is valid. Changes to these bits apply to newly -created packets, not to packets already in the stack. As a result, it -is possible to selectively request timestamps for a subset of packets -(e.g., for sampling) by embedding an send() call within two setsockopt -calls, one to enable timestamp generation and one to disable it. -Timestamps may also be generated for reasons other than being -requested by a particular socket, such as when receive timestamping is -enabled system wide, as explained earlier. - -SOF_TIMESTAMPING_RX_HARDWARE: - Request rx timestamps generated by the network adapter. - -SOF_TIMESTAMPING_RX_SOFTWARE: - Request rx timestamps when data enters the kernel. These timestamps - are generated just after a device driver hands a packet to the - kernel receive stack. - -SOF_TIMESTAMPING_TX_HARDWARE: - Request tx timestamps generated by the network adapter. This flag - can be enabled via both socket options and control messages. - -SOF_TIMESTAMPING_TX_SOFTWARE: - Request tx timestamps when data leaves the kernel. These timestamps - are generated in the device driver as close as possible, but always - prior to, passing the packet to the network interface. Hence, they - require driver support and may not be available for all devices. - This flag can be enabled via both socket options and control messages. - - -SOF_TIMESTAMPING_TX_SCHED: - Request tx timestamps prior to entering the packet scheduler. Kernel - transmit latency is, if long, often dominated by queuing delay. The - difference between this timestamp and one taken at - SOF_TIMESTAMPING_TX_SOFTWARE will expose this latency independent - of protocol processing. The latency incurred in protocol - processing, if any, can be computed by subtracting a userspace - timestamp taken immediately before send() from this timestamp. On - machines with virtual devices where a transmitted packet travels - through multiple devices and, hence, multiple packet schedulers, - a timestamp is generated at each layer. This allows for fine - grained measurement of queuing delay. This flag can be enabled - via both socket options and control messages. - -SOF_TIMESTAMPING_TX_ACK: - Request tx timestamps when all data in the send buffer has been - acknowledged. This only makes sense for reliable protocols. It is - currently only implemented for TCP. For that protocol, it may - over-report measurement, because the timestamp is generated when all - data up to and including the buffer at send() was acknowledged: the - cumulative acknowledgment. The mechanism ignores SACK and FACK. - This flag can be enabled via both socket options and control messages. - - -1.3.2 Timestamp Reporting - -The other three bits control which timestamps will be reported in a -generated control message. Changes to the bits take immediate -effect at the timestamp reporting locations in the stack. Timestamps -are only reported for packets that also have the relevant timestamp -generation request set. - -SOF_TIMESTAMPING_SOFTWARE: - Report any software timestamps when available. - -SOF_TIMESTAMPING_SYS_HARDWARE: - This option is deprecated and ignored. - -SOF_TIMESTAMPING_RAW_HARDWARE: - Report hardware timestamps as generated by - SOF_TIMESTAMPING_TX_HARDWARE when available. - - -1.3.3 Timestamp Options - -The interface supports the options - -SOF_TIMESTAMPING_OPT_ID: - - Generate a unique identifier along with each packet. A process can - have multiple concurrent timestamping requests outstanding. Packets - can be reordered in the transmit path, for instance in the packet - scheduler. In that case timestamps will be queued onto the error - queue out of order from the original send() calls. It is not always - possible to uniquely match timestamps to the original send() calls - based on timestamp order or payload inspection alone, then. - - This option associates each packet at send() with a unique - identifier and returns that along with the timestamp. The identifier - is derived from a per-socket u32 counter (that wraps). For datagram - sockets, the counter increments with each sent packet. For stream - sockets, it increments with every byte. - - The counter starts at zero. It is initialized the first time that - the socket option is enabled. It is reset each time the option is - enabled after having been disabled. Resetting the counter does not - change the identifiers of existing packets in the system. - - This option is implemented only for transmit timestamps. There, the - timestamp is always looped along with a struct sock_extended_err. - The option modifies field ee_data to pass an id that is unique - among all possibly concurrently outstanding timestamp requests for - that socket. - - -SOF_TIMESTAMPING_OPT_CMSG: - - Support recv() cmsg for all timestamped packets. Control messages - are already supported unconditionally on all packets with receive - timestamps and on IPv6 packets with transmit timestamp. This option - extends them to IPv4 packets with transmit timestamp. One use case - is to correlate packets with their egress device, by enabling socket - option IP_PKTINFO simultaneously. - - -SOF_TIMESTAMPING_OPT_TSONLY: - - Applies to transmit timestamps only. Makes the kernel return the - timestamp as a cmsg alongside an empty packet, as opposed to - alongside the original packet. This reduces the amount of memory - charged to the socket's receive budget (SO_RCVBUF) and delivers - the timestamp even if sysctl net.core.tstamp_allow_data is 0. - This option disables SOF_TIMESTAMPING_OPT_CMSG. - -SOF_TIMESTAMPING_OPT_STATS: - - Optional stats that are obtained along with the transmit timestamps. - It must be used together with SOF_TIMESTAMPING_OPT_TSONLY. When the - transmit timestamp is available, the stats are available in a - separate control message of type SCM_TIMESTAMPING_OPT_STATS, as a - list of TLVs (struct nlattr) of types. These stats allow the - application to associate various transport layer stats with - the transmit timestamps, such as how long a certain block of - data was limited by peer's receiver window. - -SOF_TIMESTAMPING_OPT_PKTINFO: - - Enable the SCM_TIMESTAMPING_PKTINFO control message for incoming - packets with hardware timestamps. The message contains struct - scm_ts_pktinfo, which supplies the index of the real interface which - received the packet and its length at layer 2. A valid (non-zero) - interface index will be returned only if CONFIG_NET_RX_BUSY_POLL is - enabled and the driver is using NAPI. The struct contains also two - other fields, but they are reserved and undefined. - -SOF_TIMESTAMPING_OPT_TX_SWHW: - - Request both hardware and software timestamps for outgoing packets - when SOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE - are enabled at the same time. If both timestamps are generated, - two separate messages will be looped to the socket's error queue, - each containing just one timestamp. - -New applications are encouraged to pass SOF_TIMESTAMPING_OPT_ID to -disambiguate timestamps and SOF_TIMESTAMPING_OPT_TSONLY to operate -regardless of the setting of sysctl net.core.tstamp_allow_data. - -An exception is when a process needs additional cmsg data, for -instance SOL_IP/IP_PKTINFO to detect the egress network interface. -Then pass option SOF_TIMESTAMPING_OPT_CMSG. This option depends on -having access to the contents of the original packet, so cannot be -combined with SOF_TIMESTAMPING_OPT_TSONLY. - - -1.3.4. Enabling timestamps via control messages - -In addition to socket options, timestamp generation can be requested -per write via cmsg, only for SOF_TIMESTAMPING_TX_* (see Section 1.3.1). -Using this feature, applications can sample timestamps per sendmsg() -without paying the overhead of enabling and disabling timestamps via -setsockopt: - - struct msghdr *msg; - ... - cmsg = CMSG_FIRSTHDR(msg); - cmsg->cmsg_level = SOL_SOCKET; - cmsg->cmsg_type = SO_TIMESTAMPING; - cmsg->cmsg_len = CMSG_LEN(sizeof(__u32)); - *((__u32 *) CMSG_DATA(cmsg)) = SOF_TIMESTAMPING_TX_SCHED | - SOF_TIMESTAMPING_TX_SOFTWARE | - SOF_TIMESTAMPING_TX_ACK; - err = sendmsg(fd, msg, 0); - -The SOF_TIMESTAMPING_TX_* flags set via cmsg will override -the SOF_TIMESTAMPING_TX_* flags set via setsockopt. - -Moreover, applications must still enable timestamp reporting via -setsockopt to receive timestamps: - - __u32 val = SOF_TIMESTAMPING_SOFTWARE | - SOF_TIMESTAMPING_OPT_ID /* or any other flag */; - err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val)); - - -1.4 Bytestream Timestamps - -The SO_TIMESTAMPING interface supports timestamping of bytes in a -bytestream. Each request is interpreted as a request for when the -entire contents of the buffer has passed a timestamping point. That -is, for streams option SOF_TIMESTAMPING_TX_SOFTWARE will record -when all bytes have reached the device driver, regardless of how -many packets the data has been converted into. - -In general, bytestreams have no natural delimiters and therefore -correlating a timestamp with data is non-trivial. A range of bytes -may be split across segments, any segments may be merged (possibly -coalescing sections of previously segmented buffers associated with -independent send() calls). Segments can be reordered and the same -byte range can coexist in multiple segments for protocols that -implement retransmissions. - -It is essential that all timestamps implement the same semantics, -regardless of these possible transformations, as otherwise they are -incomparable. Handling "rare" corner cases differently from the -simple case (a 1:1 mapping from buffer to skb) is insufficient -because performance debugging often needs to focus on such outliers. - -In practice, timestamps can be correlated with segments of a -bytestream consistently, if both semantics of the timestamp and the -timing of measurement are chosen correctly. This challenge is no -different from deciding on a strategy for IP fragmentation. There, the -definition is that only the first fragment is timestamped. For -bytestreams, we chose that a timestamp is generated only when all -bytes have passed a point. SOF_TIMESTAMPING_TX_ACK as defined is easy to -implement and reason about. An implementation that has to take into -account SACK would be more complex due to possible transmission holes -and out of order arrival. - -On the host, TCP can also break the simple 1:1 mapping from buffer to -skbuff as a result of Nagle, cork, autocork, segmentation and GSO. The -implementation ensures correctness in all cases by tracking the -individual last byte passed to send(), even if it is no longer the -last byte after an skbuff extend or merge operation. It stores the -relevant sequence number in skb_shinfo(skb)->tskey. Because an skbuff -has only one such field, only one timestamp can be generated. - -In rare cases, a timestamp request can be missed if two requests are -collapsed onto the same skb. A process can detect this situation by -enabling SOF_TIMESTAMPING_OPT_ID and comparing the byte offset at -send time with the value returned for each timestamp. It can prevent -the situation by always flushing the TCP stack in between requests, -for instance by enabling TCP_NODELAY and disabling TCP_CORK and -autocork. - -These precautions ensure that the timestamp is generated only when all -bytes have passed a timestamp point, assuming that the network stack -itself does not reorder the segments. The stack indeed tries to avoid -reordering. The one exception is under administrator control: it is -possible to construct a packet scheduler configuration that delays -segments from the same stream differently. Such a setup would be -unusual. - - -2 Data Interfaces - -Timestamps are read using the ancillary data feature of recvmsg(). -See `man 3 cmsg` for details of this interface. The socket manual -page (`man 7 socket`) describes how timestamps generated with -SO_TIMESTAMP and SO_TIMESTAMPNS records can be retrieved. - - -2.1 SCM_TIMESTAMPING records - -These timestamps are returned in a control message with cmsg_level -SOL_SOCKET, cmsg_type SCM_TIMESTAMPING, and payload of type - -For SO_TIMESTAMPING_OLD: - -struct scm_timestamping { - struct timespec ts[3]; -}; - -For SO_TIMESTAMPING_NEW: - -struct scm_timestamping64 { - struct __kernel_timespec ts[3]; - -Always use SO_TIMESTAMPING_NEW timestamp to always get timestamp in -struct scm_timestamping64 format. - -SO_TIMESTAMPING_OLD returns incorrect timestamps after the year 2038 -on 32 bit machines. - -The structure can return up to three timestamps. This is a legacy -feature. At least one field is non-zero at any time. Most timestamps -are passed in ts[0]. Hardware timestamps are passed in ts[2]. - -ts[1] used to hold hardware timestamps converted to system time. -Instead, expose the hardware clock device on the NIC directly as -a HW PTP clock source, to allow time conversion in userspace and -optionally synchronize system time with a userspace PTP stack such -as linuxptp. For the PTP clock API, see Documentation/driver-api/ptp.rst. - -Note that if the SO_TIMESTAMP or SO_TIMESTAMPNS option is enabled -together with SO_TIMESTAMPING using SOF_TIMESTAMPING_SOFTWARE, a false -software timestamp will be generated in the recvmsg() call and passed -in ts[0] when a real software timestamp is missing. This happens also -on hardware transmit timestamps. - -2.1.1 Transmit timestamps with MSG_ERRQUEUE - -For transmit timestamps the outgoing packet is looped back to the -socket's error queue with the send timestamp(s) attached. A process -receives the timestamps by calling recvmsg() with flag MSG_ERRQUEUE -set and with a msg_control buffer sufficiently large to receive the -relevant metadata structures. The recvmsg call returns the original -outgoing data packet with two ancillary messages attached. - -A message of cm_level SOL_IP(V6) and cm_type IP(V6)_RECVERR -embeds a struct sock_extended_err. This defines the error type. For -timestamps, the ee_errno field is ENOMSG. The other ancillary message -will have cm_level SOL_SOCKET and cm_type SCM_TIMESTAMPING. This -embeds the struct scm_timestamping. - - -2.1.1.2 Timestamp types - -The semantics of the three struct timespec are defined by field -ee_info in the extended error structure. It contains a value of -type SCM_TSTAMP_* to define the actual timestamp passed in -scm_timestamping. - -The SCM_TSTAMP_* types are 1:1 matches to the SOF_TIMESTAMPING_* -control fields discussed previously, with one exception. For legacy -reasons, SCM_TSTAMP_SND is equal to zero and can be set for both -SOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE. It -is the first if ts[2] is non-zero, the second otherwise, in which -case the timestamp is stored in ts[0]. - - -2.1.1.3 Fragmentation - -Fragmentation of outgoing datagrams is rare, but is possible, e.g., by -explicitly disabling PMTU discovery. If an outgoing packet is fragmented, -then only the first fragment is timestamped and returned to the sending -socket. - - -2.1.1.4 Packet Payload - -The calling application is often not interested in receiving the whole -packet payload that it passed to the stack originally: the socket -error queue mechanism is just a method to piggyback the timestamp on. -In this case, the application can choose to read datagrams with a -smaller buffer, possibly even of length 0. The payload is truncated -accordingly. Until the process calls recvmsg() on the error queue, -however, the full packet is queued, taking up budget from SO_RCVBUF. - - -2.1.1.5 Blocking Read - -Reading from the error queue is always a non-blocking operation. To -block waiting on a timestamp, use poll or select. poll() will return -POLLERR in pollfd.revents if any data is ready on the error queue. -There is no need to pass this flag in pollfd.events. This flag is -ignored on request. See also `man 2 poll`. - - -2.1.2 Receive timestamps - -On reception, there is no reason to read from the socket error queue. -The SCM_TIMESTAMPING ancillary data is sent along with the packet data -on a normal recvmsg(). Since this is not a socket error, it is not -accompanied by a message SOL_IP(V6)/IP(V6)_RECVERROR. In this case, -the meaning of the three fields in struct scm_timestamping is -implicitly defined. ts[0] holds a software timestamp if set, ts[1] -is again deprecated and ts[2] holds a hardware timestamp if set. - - -3. Hardware Timestamping configuration: SIOCSHWTSTAMP and SIOCGHWTSTAMP - -Hardware time stamping must also be initialized for each device driver -that is expected to do hardware time stamping. The parameter is defined in -include/uapi/linux/net_tstamp.h as: - -struct hwtstamp_config { - int flags; /* no flags defined right now, must be zero */ - int tx_type; /* HWTSTAMP_TX_* */ - int rx_filter; /* HWTSTAMP_FILTER_* */ -}; - -Desired behavior is passed into the kernel and to a specific device by -calling ioctl(SIOCSHWTSTAMP) with a pointer to a struct ifreq whose -ifr_data points to a struct hwtstamp_config. The tx_type and -rx_filter are hints to the driver what it is expected to do. If -the requested fine-grained filtering for incoming packets is not -supported, the driver may time stamp more than just the requested types -of packets. - -Drivers are free to use a more permissive configuration than the requested -configuration. It is expected that drivers should only implement directly the -most generic mode that can be supported. For example if the hardware can -support HWTSTAMP_FILTER_V2_EVENT, then it should generally always upscale -HWTSTAMP_FILTER_V2_L2_SYNC_MESSAGE, and so forth, as HWTSTAMP_FILTER_V2_EVENT -is more generic (and more useful to applications). - -A driver which supports hardware time stamping shall update the struct -with the actual, possibly more permissive configuration. If the -requested packets cannot be time stamped, then nothing should be -changed and ERANGE shall be returned (in contrast to EINVAL, which -indicates that SIOCSHWTSTAMP is not supported at all). - -Only a processes with admin rights may change the configuration. User -space is responsible to ensure that multiple processes don't interfere -with each other and that the settings are reset. - -Any process can read the actual configuration by passing this -structure to ioctl(SIOCGHWTSTAMP) in the same way. However, this has -not been implemented in all drivers. - -/* possible values for hwtstamp_config->tx_type */ -enum { - /* - * no outgoing packet will need hardware time stamping; - * should a packet arrive which asks for it, no hardware - * time stamping will be done - */ - HWTSTAMP_TX_OFF, - - /* - * enables hardware time stamping for outgoing packets; - * the sender of the packet decides which are to be - * time stamped by setting SOF_TIMESTAMPING_TX_SOFTWARE - * before sending the packet - */ - HWTSTAMP_TX_ON, -}; - -/* possible values for hwtstamp_config->rx_filter */ -enum { - /* time stamp no incoming packet at all */ - HWTSTAMP_FILTER_NONE, - - /* time stamp any incoming packet */ - HWTSTAMP_FILTER_ALL, - - /* return value: time stamp all packets requested plus some others */ - HWTSTAMP_FILTER_SOME, - - /* PTP v1, UDP, any kind of event packet */ - HWTSTAMP_FILTER_PTP_V1_L4_EVENT, - - /* for the complete list of values, please check - * the include file include/uapi/linux/net_tstamp.h - */ -}; - -3.1 Hardware Timestamping Implementation: Device Drivers - -A driver which supports hardware time stamping must support the -SIOCSHWTSTAMP ioctl and update the supplied struct hwtstamp_config with -the actual values as described in the section on SIOCSHWTSTAMP. It -should also support SIOCGHWTSTAMP. - -Time stamps for received packets must be stored in the skb. To get a pointer -to the shared time stamp structure of the skb call skb_hwtstamps(). Then -set the time stamps in the structure: - -struct skb_shared_hwtstamps { - /* hardware time stamp transformed into duration - * since arbitrary point in time - */ - ktime_t hwtstamp; -}; - -Time stamps for outgoing packets are to be generated as follows: -- In hard_start_xmit(), check if (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) - is set no-zero. If yes, then the driver is expected to do hardware time - stamping. -- If this is possible for the skb and requested, then declare - that the driver is doing the time stamping by setting the flag - SKBTX_IN_PROGRESS in skb_shinfo(skb)->tx_flags , e.g. with - - skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS; - - You might want to keep a pointer to the associated skb for the next step - and not free the skb. A driver not supporting hardware time stamping doesn't - do that. A driver must never touch sk_buff::tstamp! It is used to store - software generated time stamps by the network subsystem. -- Driver should call skb_tx_timestamp() as close to passing sk_buff to hardware - as possible. skb_tx_timestamp() provides a software time stamp if requested - and hardware timestamping is not possible (SKBTX_IN_PROGRESS not set). -- As soon as the driver has sent the packet and/or obtained a - hardware time stamp for it, it passes the time stamp back by - calling skb_hwtstamp_tx() with the original skb, the raw - hardware time stamp. skb_hwtstamp_tx() clones the original skb and - adds the timestamps, therefore the original skb has to be freed now. - If obtaining the hardware time stamp somehow fails, then the driver - should not fall back to software time stamping. The rationale is that - this would occur at a later time in the processing pipeline than other - software time stamping and therefore could lead to unexpected deltas - between time stamps. diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h index 0cca19670fd2..ca5cb3e3c6df 100644 --- a/include/uapi/linux/errqueue.h +++ b/include/uapi/linux/errqueue.h @@ -36,7 +36,7 @@ struct sock_extended_err { * * The timestamping interfaces SO_TIMESTAMPING, MSG_TSTAMP_* * communicate network timestamps by passing this struct in a cmsg with - * recvmsg(). See Documentation/networking/timestamping.txt for details. + * recvmsg(). See Documentation/networking/timestamping.rst for details. * User space sees a timespec definition that matches either * __kernel_timespec or __kernel_old_timespec, in the kernel we * require two structure definitions to provide both. -- cgit v1.2.3 From 4ac0b122ee63d89b5aaf2e3e376092d8ac02a567 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:32 +0200 Subject: docs: networking: convert tproxy.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/tproxy.rst | 109 ++++++++++++++++++++++++++++++++++++ Documentation/networking/tproxy.txt | 104 ---------------------------------- net/netfilter/Kconfig | 2 +- 4 files changed, 111 insertions(+), 105 deletions(-) create mode 100644 Documentation/networking/tproxy.rst delete mode 100644 Documentation/networking/tproxy.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 8f9a84b8e3f2..b423b2db5f96 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -110,6 +110,7 @@ Contents: tcp-thin team timestamping + tproxy .. only:: subproject and html diff --git a/Documentation/networking/tproxy.rst b/Documentation/networking/tproxy.rst new file mode 100644 index 000000000000..00dc3a1a66b4 --- /dev/null +++ b/Documentation/networking/tproxy.rst @@ -0,0 +1,109 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================= +Transparent proxy support +========================= + +This feature adds Linux 2.2-like transparent proxy support to current kernels. +To use it, enable the socket match and the TPROXY target in your kernel config. +You will need policy routing too, so be sure to enable that as well. + +From Linux 4.18 transparent proxy support is also available in nf_tables. + +1. Making non-local sockets work +================================ + +The idea is that you identify packets with destination address matching a local +socket on your box, set the packet mark to a certain value:: + + # iptables -t mangle -N DIVERT + # iptables -t mangle -A PREROUTING -p tcp -m socket -j DIVERT + # iptables -t mangle -A DIVERT -j MARK --set-mark 1 + # iptables -t mangle -A DIVERT -j ACCEPT + +Alternatively you can do this in nft with the following commands:: + + # nft add table filter + # nft add chain filter divert "{ type filter hook prerouting priority -150; }" + # nft add rule filter divert meta l4proto tcp socket transparent 1 meta mark set 1 accept + +And then match on that value using policy routing to have those packets +delivered locally:: + + # ip rule add fwmark 1 lookup 100 + # ip route add local 0.0.0.0/0 dev lo table 100 + +Because of certain restrictions in the IPv4 routing output code you'll have to +modify your application to allow it to send datagrams _from_ non-local IP +addresses. All you have to do is enable the (SOL_IP, IP_TRANSPARENT) socket +option before calling bind:: + + fd = socket(AF_INET, SOCK_STREAM, 0); + /* - 8< -*/ + int value = 1; + setsockopt(fd, SOL_IP, IP_TRANSPARENT, &value, sizeof(value)); + /* - 8< -*/ + name.sin_family = AF_INET; + name.sin_port = htons(0xCAFE); + name.sin_addr.s_addr = htonl(0xDEADBEEF); + bind(fd, &name, sizeof(name)); + +A trivial patch for netcat is available here: +http://people.netfilter.org/hidden/tproxy/netcat-ip_transparent-support.patch + + +2. Redirecting traffic +====================== + +Transparent proxying often involves "intercepting" traffic on a router. This is +usually done with the iptables REDIRECT target; however, there are serious +limitations of that method. One of the major issues is that it actually +modifies the packets to change the destination address -- which might not be +acceptable in certain situations. (Think of proxying UDP for example: you won't +be able to find out the original destination address. Even in case of TCP +getting the original destination address is racy.) + +The 'TPROXY' target provides similar functionality without relying on NAT. Simply +add rules like this to the iptables ruleset above:: + + # iptables -t mangle -A PREROUTING -p tcp --dport 80 -j TPROXY \ + --tproxy-mark 0x1/0x1 --on-port 50080 + +Or the following rule to nft: + +# nft add rule filter divert tcp dport 80 tproxy to :50080 meta mark set 1 accept + +Note that for this to work you'll have to modify the proxy to enable (SOL_IP, +IP_TRANSPARENT) for the listening socket. + +As an example implementation, tcprdr is available here: +https://git.breakpoint.cc/cgit/fw/tcprdr.git/ +This tool is written by Florian Westphal and it was used for testing during the +nf_tables implementation. + +3. Iptables and nf_tables extensions +==================================== + +To use tproxy you'll need to have the following modules compiled for iptables: + + - NETFILTER_XT_MATCH_SOCKET + - NETFILTER_XT_TARGET_TPROXY + +Or the floowing modules for nf_tables: + + - NFT_SOCKET + - NFT_TPROXY + +4. Application support +====================== + +4.1. Squid +---------- + +Squid 3.HEAD has support built-in. To use it, pass +'--enable-linux-netfilter' to configure and set the 'tproxy' option on +the HTTP listener you redirect traffic to with the TPROXY iptables +target. + +For more information please consult the following page on the Squid +wiki: http://wiki.squid-cache.org/Features/Tproxy4 diff --git a/Documentation/networking/tproxy.txt b/Documentation/networking/tproxy.txt deleted file mode 100644 index b9a188823d9f..000000000000 --- a/Documentation/networking/tproxy.txt +++ /dev/null @@ -1,104 +0,0 @@ -Transparent proxy support -========================= - -This feature adds Linux 2.2-like transparent proxy support to current kernels. -To use it, enable the socket match and the TPROXY target in your kernel config. -You will need policy routing too, so be sure to enable that as well. - -From Linux 4.18 transparent proxy support is also available in nf_tables. - -1. Making non-local sockets work -================================ - -The idea is that you identify packets with destination address matching a local -socket on your box, set the packet mark to a certain value: - -# iptables -t mangle -N DIVERT -# iptables -t mangle -A PREROUTING -p tcp -m socket -j DIVERT -# iptables -t mangle -A DIVERT -j MARK --set-mark 1 -# iptables -t mangle -A DIVERT -j ACCEPT - -Alternatively you can do this in nft with the following commands: - -# nft add table filter -# nft add chain filter divert "{ type filter hook prerouting priority -150; }" -# nft add rule filter divert meta l4proto tcp socket transparent 1 meta mark set 1 accept - -And then match on that value using policy routing to have those packets -delivered locally: - -# ip rule add fwmark 1 lookup 100 -# ip route add local 0.0.0.0/0 dev lo table 100 - -Because of certain restrictions in the IPv4 routing output code you'll have to -modify your application to allow it to send datagrams _from_ non-local IP -addresses. All you have to do is enable the (SOL_IP, IP_TRANSPARENT) socket -option before calling bind: - -fd = socket(AF_INET, SOCK_STREAM, 0); -/* - 8< -*/ -int value = 1; -setsockopt(fd, SOL_IP, IP_TRANSPARENT, &value, sizeof(value)); -/* - 8< -*/ -name.sin_family = AF_INET; -name.sin_port = htons(0xCAFE); -name.sin_addr.s_addr = htonl(0xDEADBEEF); -bind(fd, &name, sizeof(name)); - -A trivial patch for netcat is available here: -http://people.netfilter.org/hidden/tproxy/netcat-ip_transparent-support.patch - - -2. Redirecting traffic -====================== - -Transparent proxying often involves "intercepting" traffic on a router. This is -usually done with the iptables REDIRECT target; however, there are serious -limitations of that method. One of the major issues is that it actually -modifies the packets to change the destination address -- which might not be -acceptable in certain situations. (Think of proxying UDP for example: you won't -be able to find out the original destination address. Even in case of TCP -getting the original destination address is racy.) - -The 'TPROXY' target provides similar functionality without relying on NAT. Simply -add rules like this to the iptables ruleset above: - -# iptables -t mangle -A PREROUTING -p tcp --dport 80 -j TPROXY \ - --tproxy-mark 0x1/0x1 --on-port 50080 - -Or the following rule to nft: - -# nft add rule filter divert tcp dport 80 tproxy to :50080 meta mark set 1 accept - -Note that for this to work you'll have to modify the proxy to enable (SOL_IP, -IP_TRANSPARENT) for the listening socket. - -As an example implementation, tcprdr is available here: -https://git.breakpoint.cc/cgit/fw/tcprdr.git/ -This tool is written by Florian Westphal and it was used for testing during the -nf_tables implementation. - -3. Iptables and nf_tables extensions -==================================== - -To use tproxy you'll need to have the following modules compiled for iptables: - - NETFILTER_XT_MATCH_SOCKET - - NETFILTER_XT_TARGET_TPROXY - -Or the floowing modules for nf_tables: - - NFT_SOCKET - - NFT_TPROXY - -4. Application support -====================== - -4.1. Squid ----------- - -Squid 3.HEAD has support built-in. To use it, pass -'--enable-linux-netfilter' to configure and set the 'tproxy' option on -the HTTP listener you redirect traffic to with the TPROXY iptables -target. - -For more information please consult the following page on the Squid -wiki: http://wiki.squid-cache.org/Features/Tproxy4 diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig index 468fea1aebba..3a3915d2e1ea 100644 --- a/net/netfilter/Kconfig +++ b/net/netfilter/Kconfig @@ -1043,7 +1043,7 @@ config NETFILTER_XT_TARGET_TPROXY on Netfilter connection tracking and NAT, unlike REDIRECT. For it to work you will have to configure certain iptables rules and use policy routing. For more information on how to set it up - see Documentation/networking/tproxy.txt. + see Documentation/networking/tproxy.rst. To compile it as a module, choose M here. If unsure, say N. -- cgit v1.2.3 From a70437cc09a11771870e9f6bfc0ba1237161daa8 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 30 Apr 2020 10:35:43 -0700 Subject: tcp: add hrtimer slack to sack compression Add a sysctl to control hrtimer slack, default of 100 usec. This gives the opportunity to reduce system overhead, and help very short RTT flows. Signed-off-by: Eric Dumazet Acked-by: Soheil Hassas Yeganeh Acked-by: Neal Cardwell Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.rst | 8 ++++++++ include/net/netns/ipv4.h | 1 + net/ipv4/sysctl_net_ipv4.c | 7 +++++++ net/ipv4/tcp_input.c | 5 +++-- net/ipv4/tcp_ipv4.c | 1 + 5 files changed, 20 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 3266aee9e052..50b440d29a13 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -651,6 +651,14 @@ tcp_comp_sack_delay_ns - LONG INTEGER Default : 1,000,000 ns (1 ms) +tcp_comp_sack_slack_ns - LONG INTEGER + This sysctl control the slack used when arming the + timer used by SACK compression. This gives extra time + for small RTT flows, and reduces system overhead by allowing + opportunistic reduction of timer interrupts. + + Default : 100,000 ns (100 us) + tcp_comp_sack_nr - INTEGER Max number of SACK that can be compressed. Using 0 disables SACK compression. diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index 5acdb4d414c4..9e36738c1fe1 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -173,6 +173,7 @@ struct netns_ipv4 { int sysctl_tcp_rmem[3]; int sysctl_tcp_comp_sack_nr; unsigned long sysctl_tcp_comp_sack_delay_ns; + unsigned long sysctl_tcp_comp_sack_slack_ns; struct inet_timewait_death_row tcp_death_row; int sysctl_max_syn_backlog; int sysctl_tcp_fastopen; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 95ad71e76cc3..3a628423d27b 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -1329,6 +1329,13 @@ static struct ctl_table ipv4_net_table[] = { .mode = 0644, .proc_handler = proc_doulongvec_minmax, }, + { + .procname = "tcp_comp_sack_slack_ns", + .data = &init_net.ipv4.sysctl_tcp_comp_sack_slack_ns, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = proc_doulongvec_minmax, + }, { .procname = "tcp_comp_sack_nr", .data = &init_net.ipv4.sysctl_tcp_comp_sack_nr, diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index ef921ecba415..d68128a672ab 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5324,8 +5324,9 @@ send_now: delay = min_t(unsigned long, sock_net(sk)->ipv4.sysctl_tcp_comp_sack_delay_ns, rtt * (NSEC_PER_USEC >> 3)/20); sock_hold(sk); - hrtimer_start(&tp->compressed_ack_timer, ns_to_ktime(delay), - HRTIMER_MODE_REL_PINNED_SOFT); + hrtimer_start_range_ns(&tp->compressed_ack_timer, ns_to_ktime(delay), + sock_net(sk)->ipv4.sysctl_tcp_comp_sack_slack_ns, + HRTIMER_MODE_REL_PINNED_SOFT); } static inline void tcp_ack_snd_check(struct sock *sk) diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 83a5d24e13b8..6c05f1ceb538 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2780,6 +2780,7 @@ static int __net_init tcp_sk_init(struct net *net) sizeof(init_net.ipv4.sysctl_tcp_wmem)); } net->ipv4.sysctl_tcp_comp_sack_delay_ns = NSEC_PER_MSEC; + net->ipv4.sysctl_tcp_comp_sack_slack_ns = 100 * NSEC_PER_USEC; net->ipv4.sysctl_tcp_comp_sack_nr = 44; net->ipv4.sysctl_tcp_fastopen = TFO_CLIENT_ENABLE; spin_lock_init(&net->ipv4.tcp_fastopen_ctx_lock); -- cgit v1.2.3 From 973d55e590beeca13fece60596ee3b511d36d9da Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:23 +0200 Subject: docs: networking: convert tuntap.txt to ReST - add SPDX header; - use copyright symbol; - adjust titles and chapters, adding proper markups; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/tuntap.rst | 259 ++++++++++++++++++++++++++++++++++++ Documentation/networking/tuntap.txt | 227 ------------------------------- MAINTAINERS | 2 +- drivers/net/Kconfig | 2 +- 5 files changed, 262 insertions(+), 229 deletions(-) create mode 100644 Documentation/networking/tuntap.rst delete mode 100644 Documentation/networking/tuntap.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index b423b2db5f96..e7a683f0528d 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -111,6 +111,7 @@ Contents: team timestamping tproxy + tuntap .. only:: subproject and html diff --git a/Documentation/networking/tuntap.rst b/Documentation/networking/tuntap.rst new file mode 100644 index 000000000000..a59d1dd6fdcc --- /dev/null +++ b/Documentation/networking/tuntap.rst @@ -0,0 +1,259 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +=============================== +Universal TUN/TAP device driver +=============================== + +Copyright |copy| 1999-2000 Maxim Krasnyansky + + Linux, Solaris drivers + Copyright |copy| 1999-2000 Maxim Krasnyansky + + FreeBSD TAP driver + Copyright |copy| 1999-2000 Maksim Yevmenkin + + Revision of this document 2002 by Florian Thiel + +1. Description +============== + + TUN/TAP provides packet reception and transmission for user space programs. + It can be seen as a simple Point-to-Point or Ethernet device, which, + instead of receiving packets from physical media, receives them from + user space program and instead of sending packets via physical media + writes them to the user space program. + + In order to use the driver a program has to open /dev/net/tun and issue a + corresponding ioctl() to register a network device with the kernel. A network + device will appear as tunXX or tapXX, depending on the options chosen. When + the program closes the file descriptor, the network device and all + corresponding routes will disappear. + + Depending on the type of device chosen the userspace program has to read/write + IP packets (with tun) or ethernet frames (with tap). Which one is being used + depends on the flags given with the ioctl(). + + The package from http://vtun.sourceforge.net/tun contains two simple examples + for how to use tun and tap devices. Both programs work like a bridge between + two network interfaces. + br_select.c - bridge based on select system call. + br_sigio.c - bridge based on async io and SIGIO signal. + However, the best example is VTun http://vtun.sourceforge.net :)) + +2. Configuration +================ + + Create device node:: + + mkdir /dev/net (if it doesn't exist already) + mknod /dev/net/tun c 10 200 + + Set permissions:: + + e.g. chmod 0666 /dev/net/tun + + There's no harm in allowing the device to be accessible by non-root users, + since CAP_NET_ADMIN is required for creating network devices or for + connecting to network devices which aren't owned by the user in question. + If you want to create persistent devices and give ownership of them to + unprivileged users, then you need the /dev/net/tun device to be usable by + those users. + + Driver module autoloading + + Make sure that "Kernel module loader" - module auto-loading + support is enabled in your kernel. The kernel should load it on + first access. + + Manual loading + + insert the module by hand:: + + modprobe tun + + If you do it the latter way, you have to load the module every time you + need it, if you do it the other way it will be automatically loaded when + /dev/net/tun is being opened. + +3. Program interface +==================== + +3.1 Network device allocation +----------------------------- + +``char *dev`` should be the name of the device with a format string (e.g. +"tun%d"), but (as far as I can see) this can be any valid network device name. +Note that the character pointer becomes overwritten with the real device name +(e.g. "tun0"):: + + #include + #include + + int tun_alloc(char *dev) + { + struct ifreq ifr; + int fd, err; + + if( (fd = open("/dev/net/tun", O_RDWR)) < 0 ) + return tun_alloc_old(dev); + + memset(&ifr, 0, sizeof(ifr)); + + /* Flags: IFF_TUN - TUN device (no Ethernet headers) + * IFF_TAP - TAP device + * + * IFF_NO_PI - Do not provide packet information + */ + ifr.ifr_flags = IFF_TUN; + if( *dev ) + strncpy(ifr.ifr_name, dev, IFNAMSIZ); + + if( (err = ioctl(fd, TUNSETIFF, (void *) &ifr)) < 0 ){ + close(fd); + return err; + } + strcpy(dev, ifr.ifr_name); + return fd; + } + +3.2 Frame format +---------------- + +If flag IFF_NO_PI is not set each frame format is:: + + Flags [2 bytes] + Proto [2 bytes] + Raw protocol(IP, IPv6, etc) frame. + +3.3 Multiqueue tuntap interface +------------------------------- + +From version 3.8, Linux supports multiqueue tuntap which can uses multiple +file descriptors (queues) to parallelize packets sending or receiving. The +device allocation is the same as before, and if user wants to create multiple +queues, TUNSETIFF with the same device name must be called many times with +IFF_MULTI_QUEUE flag. + +``char *dev`` should be the name of the device, queues is the number of queues +to be created, fds is used to store and return the file descriptors (queues) +created to the caller. Each file descriptor were served as the interface of a +queue which could be accessed by userspace. + +:: + + #include + #include + + int tun_alloc_mq(char *dev, int queues, int *fds) + { + struct ifreq ifr; + int fd, err, i; + + if (!dev) + return -1; + + memset(&ifr, 0, sizeof(ifr)); + /* Flags: IFF_TUN - TUN device (no Ethernet headers) + * IFF_TAP - TAP device + * + * IFF_NO_PI - Do not provide packet information + * IFF_MULTI_QUEUE - Create a queue of multiqueue device + */ + ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_MULTI_QUEUE; + strcpy(ifr.ifr_name, dev); + + for (i = 0; i < queues; i++) { + if ((fd = open("/dev/net/tun", O_RDWR)) < 0) + goto err; + err = ioctl(fd, TUNSETIFF, (void *)&ifr); + if (err) { + close(fd); + goto err; + } + fds[i] = fd; + } + + return 0; + err: + for (--i; i >= 0; i--) + close(fds[i]); + return err; + } + +A new ioctl(TUNSETQUEUE) were introduced to enable or disable a queue. When +calling it with IFF_DETACH_QUEUE flag, the queue were disabled. And when +calling it with IFF_ATTACH_QUEUE flag, the queue were enabled. The queue were +enabled by default after it was created through TUNSETIFF. + +fd is the file descriptor (queue) that we want to enable or disable, when +enable is true we enable it, otherwise we disable it:: + + #include + #include + + int tun_set_queue(int fd, int enable) + { + struct ifreq ifr; + + memset(&ifr, 0, sizeof(ifr)); + + if (enable) + ifr.ifr_flags = IFF_ATTACH_QUEUE; + else + ifr.ifr_flags = IFF_DETACH_QUEUE; + + return ioctl(fd, TUNSETQUEUE, (void *)&ifr); + } + +Universal TUN/TAP device driver Frequently Asked Question +========================================================= + +1. What platforms are supported by TUN/TAP driver ? + +Currently driver has been written for 3 Unices: + + - Linux kernels 2.2.x, 2.4.x + - FreeBSD 3.x, 4.x, 5.x + - Solaris 2.6, 7.0, 8.0 + +2. What is TUN/TAP driver used for? + +As mentioned above, main purpose of TUN/TAP driver is tunneling. +It is used by VTun (http://vtun.sourceforge.net). + +Another interesting application using TUN/TAP is pipsecd +(http://perso.enst.fr/~beyssac/pipsec/), a userspace IPSec +implementation that can use complete kernel routing (unlike FreeS/WAN). + +3. How does Virtual network device actually work ? + +Virtual network device can be viewed as a simple Point-to-Point or +Ethernet device, which instead of receiving packets from a physical +media, receives them from user space program and instead of sending +packets via physical media sends them to the user space program. + +Let's say that you configured IPv6 on the tap0, then whenever +the kernel sends an IPv6 packet to tap0, it is passed to the application +(VTun for example). The application encrypts, compresses and sends it to +the other side over TCP or UDP. The application on the other side decompresses +and decrypts the data received and writes the packet to the TAP device, +the kernel handles the packet like it came from real physical device. + +4. What is the difference between TUN driver and TAP driver? + +TUN works with IP frames. TAP works with Ethernet frames. + +This means that you have to read/write IP packets when you are using tun and +ethernet frames when using tap. + +5. What is the difference between BPF and TUN/TAP driver? + +BPF is an advanced packet filter. It can be attached to existing +network interface. It does not provide a virtual network interface. +A TUN/TAP driver does provide a virtual network interface and it is possible +to attach BPF to this interface. + +6. Does TAP driver support kernel Ethernet bridging? + +Yes. Linux and FreeBSD drivers support Ethernet bridging. diff --git a/Documentation/networking/tuntap.txt b/Documentation/networking/tuntap.txt deleted file mode 100644 index 0104830d5075..000000000000 --- a/Documentation/networking/tuntap.txt +++ /dev/null @@ -1,227 +0,0 @@ -Universal TUN/TAP device driver. -Copyright (C) 1999-2000 Maxim Krasnyansky - - Linux, Solaris drivers - Copyright (C) 1999-2000 Maxim Krasnyansky - - FreeBSD TAP driver - Copyright (c) 1999-2000 Maksim Yevmenkin - - Revision of this document 2002 by Florian Thiel - -1. Description - TUN/TAP provides packet reception and transmission for user space programs. - It can be seen as a simple Point-to-Point or Ethernet device, which, - instead of receiving packets from physical media, receives them from - user space program and instead of sending packets via physical media - writes them to the user space program. - - In order to use the driver a program has to open /dev/net/tun and issue a - corresponding ioctl() to register a network device with the kernel. A network - device will appear as tunXX or tapXX, depending on the options chosen. When - the program closes the file descriptor, the network device and all - corresponding routes will disappear. - - Depending on the type of device chosen the userspace program has to read/write - IP packets (with tun) or ethernet frames (with tap). Which one is being used - depends on the flags given with the ioctl(). - - The package from http://vtun.sourceforge.net/tun contains two simple examples - for how to use tun and tap devices. Both programs work like a bridge between - two network interfaces. - br_select.c - bridge based on select system call. - br_sigio.c - bridge based on async io and SIGIO signal. - However, the best example is VTun http://vtun.sourceforge.net :)) - -2. Configuration - Create device node: - mkdir /dev/net (if it doesn't exist already) - mknod /dev/net/tun c 10 200 - - Set permissions: - e.g. chmod 0666 /dev/net/tun - There's no harm in allowing the device to be accessible by non-root users, - since CAP_NET_ADMIN is required for creating network devices or for - connecting to network devices which aren't owned by the user in question. - If you want to create persistent devices and give ownership of them to - unprivileged users, then you need the /dev/net/tun device to be usable by - those users. - - Driver module autoloading - - Make sure that "Kernel module loader" - module auto-loading - support is enabled in your kernel. The kernel should load it on - first access. - - Manual loading - insert the module by hand: - modprobe tun - - If you do it the latter way, you have to load the module every time you - need it, if you do it the other way it will be automatically loaded when - /dev/net/tun is being opened. - -3. Program interface - 3.1 Network device allocation: - - char *dev should be the name of the device with a format string (e.g. - "tun%d"), but (as far as I can see) this can be any valid network device name. - Note that the character pointer becomes overwritten with the real device name - (e.g. "tun0") - - #include - #include - - int tun_alloc(char *dev) - { - struct ifreq ifr; - int fd, err; - - if( (fd = open("/dev/net/tun", O_RDWR)) < 0 ) - return tun_alloc_old(dev); - - memset(&ifr, 0, sizeof(ifr)); - - /* Flags: IFF_TUN - TUN device (no Ethernet headers) - * IFF_TAP - TAP device - * - * IFF_NO_PI - Do not provide packet information - */ - ifr.ifr_flags = IFF_TUN; - if( *dev ) - strncpy(ifr.ifr_name, dev, IFNAMSIZ); - - if( (err = ioctl(fd, TUNSETIFF, (void *) &ifr)) < 0 ){ - close(fd); - return err; - } - strcpy(dev, ifr.ifr_name); - return fd; - } - - 3.2 Frame format: - If flag IFF_NO_PI is not set each frame format is: - Flags [2 bytes] - Proto [2 bytes] - Raw protocol(IP, IPv6, etc) frame. - - 3.3 Multiqueue tuntap interface: - - From version 3.8, Linux supports multiqueue tuntap which can uses multiple - file descriptors (queues) to parallelize packets sending or receiving. The - device allocation is the same as before, and if user wants to create multiple - queues, TUNSETIFF with the same device name must be called many times with - IFF_MULTI_QUEUE flag. - - char *dev should be the name of the device, queues is the number of queues to - be created, fds is used to store and return the file descriptors (queues) - created to the caller. Each file descriptor were served as the interface of a - queue which could be accessed by userspace. - - #include - #include - - int tun_alloc_mq(char *dev, int queues, int *fds) - { - struct ifreq ifr; - int fd, err, i; - - if (!dev) - return -1; - - memset(&ifr, 0, sizeof(ifr)); - /* Flags: IFF_TUN - TUN device (no Ethernet headers) - * IFF_TAP - TAP device - * - * IFF_NO_PI - Do not provide packet information - * IFF_MULTI_QUEUE - Create a queue of multiqueue device - */ - ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_MULTI_QUEUE; - strcpy(ifr.ifr_name, dev); - - for (i = 0; i < queues; i++) { - if ((fd = open("/dev/net/tun", O_RDWR)) < 0) - goto err; - err = ioctl(fd, TUNSETIFF, (void *)&ifr); - if (err) { - close(fd); - goto err; - } - fds[i] = fd; - } - - return 0; - err: - for (--i; i >= 0; i--) - close(fds[i]); - return err; - } - - A new ioctl(TUNSETQUEUE) were introduced to enable or disable a queue. When - calling it with IFF_DETACH_QUEUE flag, the queue were disabled. And when - calling it with IFF_ATTACH_QUEUE flag, the queue were enabled. The queue were - enabled by default after it was created through TUNSETIFF. - - fd is the file descriptor (queue) that we want to enable or disable, when - enable is true we enable it, otherwise we disable it - - #include - #include - - int tun_set_queue(int fd, int enable) - { - struct ifreq ifr; - - memset(&ifr, 0, sizeof(ifr)); - - if (enable) - ifr.ifr_flags = IFF_ATTACH_QUEUE; - else - ifr.ifr_flags = IFF_DETACH_QUEUE; - - return ioctl(fd, TUNSETQUEUE, (void *)&ifr); - } - -Universal TUN/TAP device driver Frequently Asked Question. - -1. What platforms are supported by TUN/TAP driver ? -Currently driver has been written for 3 Unices: - Linux kernels 2.2.x, 2.4.x - FreeBSD 3.x, 4.x, 5.x - Solaris 2.6, 7.0, 8.0 - -2. What is TUN/TAP driver used for? -As mentioned above, main purpose of TUN/TAP driver is tunneling. -It is used by VTun (http://vtun.sourceforge.net). - -Another interesting application using TUN/TAP is pipsecd -(http://perso.enst.fr/~beyssac/pipsec/), a userspace IPSec -implementation that can use complete kernel routing (unlike FreeS/WAN). - -3. How does Virtual network device actually work ? -Virtual network device can be viewed as a simple Point-to-Point or -Ethernet device, which instead of receiving packets from a physical -media, receives them from user space program and instead of sending -packets via physical media sends them to the user space program. - -Let's say that you configured IPv6 on the tap0, then whenever -the kernel sends an IPv6 packet to tap0, it is passed to the application -(VTun for example). The application encrypts, compresses and sends it to -the other side over TCP or UDP. The application on the other side decompresses -and decrypts the data received and writes the packet to the TAP device, -the kernel handles the packet like it came from real physical device. - -4. What is the difference between TUN driver and TAP driver? -TUN works with IP frames. TAP works with Ethernet frames. - -This means that you have to read/write IP packets when you are using tun and -ethernet frames when using tap. - -5. What is the difference between BPF and TUN/TAP driver? -BPF is an advanced packet filter. It can be attached to existing -network interface. It does not provide a virtual network interface. -A TUN/TAP driver does provide a virtual network interface and it is possible -to attach BPF to this interface. - -6. Does TAP driver support kernel Ethernet bridging? -Yes. Linux and FreeBSD drivers support Ethernet bridging. diff --git a/MAINTAINERS b/MAINTAINERS index 0ac9cec0bce6..6456c5bb02f1 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -17161,7 +17161,7 @@ TUN/TAP driver M: Maxim Krasnyansky S: Maintained W: http://vtun.sourceforge.net/tun -F: Documentation/networking/tuntap.txt +F: Documentation/networking/tuntap.rst F: arch/um/os-Linux/drivers/ TURBOCHANNEL SUBSYSTEM diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index ad64be98330f..3f2c98a7906c 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -355,7 +355,7 @@ config TUN devices, driver will automatically delete tunXX or tapXX device and all routes corresponding to it. - Please read for more + Please read for more information. To compile this driver as a module, choose M here: the module -- cgit v1.2.3 From 961fb1ff412a2cefaf50f4f56bb60a10ed071df5 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:24 +0200 Subject: docs: networking: convert udplite.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - mark lists as such; - mark tables as such; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/udplite.rst | 291 +++++++++++++++++++++++++++++++++++ Documentation/networking/udplite.txt | 278 --------------------------------- 3 files changed, 292 insertions(+), 278 deletions(-) create mode 100644 Documentation/networking/udplite.rst delete mode 100644 Documentation/networking/udplite.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index e7a683f0528d..ca0b0dbfd9ad 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -112,6 +112,7 @@ Contents: timestamping tproxy tuntap + udplite .. only:: subproject and html diff --git a/Documentation/networking/udplite.rst b/Documentation/networking/udplite.rst new file mode 100644 index 000000000000..2c225f28b7b2 --- /dev/null +++ b/Documentation/networking/udplite.rst @@ -0,0 +1,291 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================ +The UDP-Lite protocol (RFC 3828) +================================ + + + UDP-Lite is a Standards-Track IETF transport protocol whose characteristic + is a variable-length checksum. This has advantages for transport of multimedia + (video, VoIP) over wireless networks, as partly damaged packets can still be + fed into the codec instead of being discarded due to a failed checksum test. + + This file briefly describes the existing kernel support and the socket API. + For in-depth information, you can consult: + + - The UDP-Lite Homepage: + http://web.archive.org/web/%2E/http://www.erg.abdn.ac.uk/users/gerrit/udp-lite/ + + From here you can also download some example application source code. + + - The UDP-Lite HOWTO on + http://web.archive.org/web/%2E/http://www.erg.abdn.ac.uk/users/gerrit/udp-lite/files/UDP-Lite-HOWTO.txt + + - The Wireshark UDP-Lite WiKi (with capture files): + https://wiki.wireshark.org/Lightweight_User_Datagram_Protocol + + - The Protocol Spec, RFC 3828, http://www.ietf.org/rfc/rfc3828.txt + + +1. Applications +=============== + + Several applications have been ported successfully to UDP-Lite. Ethereal + (now called wireshark) has UDP-Litev4/v6 support by default. + + Porting applications to UDP-Lite is straightforward: only socket level and + IPPROTO need to be changed; senders additionally set the checksum coverage + length (default = header length = 8). Details are in the next section. + +2. Programming API +================== + + UDP-Lite provides a connectionless, unreliable datagram service and hence + uses the same socket type as UDP. In fact, porting from UDP to UDP-Lite is + very easy: simply add ``IPPROTO_UDPLITE`` as the last argument of the + socket(2) call so that the statement looks like:: + + s = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDPLITE); + + or, respectively, + + :: + + s = socket(PF_INET6, SOCK_DGRAM, IPPROTO_UDPLITE); + + With just the above change you are able to run UDP-Lite services or connect + to UDP-Lite servers. The kernel will assume that you are not interested in + using partial checksum coverage and so emulate UDP mode (full coverage). + + To make use of the partial checksum coverage facilities requires setting a + single socket option, which takes an integer specifying the coverage length: + + * Sender checksum coverage: UDPLITE_SEND_CSCOV + + For example:: + + int val = 20; + setsockopt(s, SOL_UDPLITE, UDPLITE_SEND_CSCOV, &val, sizeof(int)); + + sets the checksum coverage length to 20 bytes (12b data + 8b header). + Of each packet only the first 20 bytes (plus the pseudo-header) will be + checksummed. This is useful for RTP applications which have a 12-byte + base header. + + + * Receiver checksum coverage: UDPLITE_RECV_CSCOV + + This option is the receiver-side analogue. It is truly optional, i.e. not + required to enable traffic with partial checksum coverage. Its function is + that of a traffic filter: when enabled, it instructs the kernel to drop + all packets which have a coverage _less_ than this value. For example, if + RTP and UDP headers are to be protected, a receiver can enforce that only + packets with a minimum coverage of 20 are admitted:: + + int min = 20; + setsockopt(s, SOL_UDPLITE, UDPLITE_RECV_CSCOV, &min, sizeof(int)); + + The calls to getsockopt(2) are analogous. Being an extension and not a stand- + alone protocol, all socket options known from UDP can be used in exactly the + same manner as before, e.g. UDP_CORK or UDP_ENCAP. + + A detailed discussion of UDP-Lite checksum coverage options is in section IV. + +3. Header Files +=============== + + The socket API requires support through header files in /usr/include: + + * /usr/include/netinet/in.h + to define IPPROTO_UDPLITE + + * /usr/include/netinet/udplite.h + for UDP-Lite header fields and protocol constants + + For testing purposes, the following can serve as a ``mini`` header file:: + + #define IPPROTO_UDPLITE 136 + #define SOL_UDPLITE 136 + #define UDPLITE_SEND_CSCOV 10 + #define UDPLITE_RECV_CSCOV 11 + + Ready-made header files for various distros are in the UDP-Lite tarball. + +4. Kernel Behaviour with Regards to the Various Socket Options +============================================================== + + + To enable debugging messages, the log level need to be set to 8, as most + messages use the KERN_DEBUG level (7). + + 1) Sender Socket Options + + If the sender specifies a value of 0 as coverage length, the module + assumes full coverage, transmits a packet with coverage length of 0 + and according checksum. If the sender specifies a coverage < 8 and + different from 0, the kernel assumes 8 as default value. Finally, + if the specified coverage length exceeds the packet length, the packet + length is used instead as coverage length. + + 2) Receiver Socket Options + + The receiver specifies the minimum value of the coverage length it + is willing to accept. A value of 0 here indicates that the receiver + always wants the whole of the packet covered. In this case, all + partially covered packets are dropped and an error is logged. + + It is not possible to specify illegal values (<0 and <8); in these + cases the default of 8 is assumed. + + All packets arriving with a coverage value less than the specified + threshold are discarded, these events are also logged. + + 3) Disabling the Checksum Computation + + On both sender and receiver, checksumming will always be performed + and cannot be disabled using SO_NO_CHECK. Thus:: + + setsockopt(sockfd, SOL_SOCKET, SO_NO_CHECK, ... ); + + will always will be ignored, while the value of:: + + getsockopt(sockfd, SOL_SOCKET, SO_NO_CHECK, &value, ...); + + is meaningless (as in TCP). Packets with a zero checksum field are + illegal (cf. RFC 3828, sec. 3.1) and will be silently discarded. + + 4) Fragmentation + + The checksum computation respects both buffersize and MTU. The size + of UDP-Lite packets is determined by the size of the send buffer. The + minimum size of the send buffer is 2048 (defined as SOCK_MIN_SNDBUF + in include/net/sock.h), the default value is configurable as + net.core.wmem_default or via setting the SO_SNDBUF socket(7) + option. The maximum upper bound for the send buffer is determined + by net.core.wmem_max. + + Given a payload size larger than the send buffer size, UDP-Lite will + split the payload into several individual packets, filling up the + send buffer size in each case. + + The precise value also depends on the interface MTU. The interface MTU, + in turn, may trigger IP fragmentation. In this case, the generated + UDP-Lite packet is split into several IP packets, of which only the + first one contains the L4 header. + + The send buffer size has implications on the checksum coverage length. + Consider the following example:: + + Payload: 1536 bytes Send Buffer: 1024 bytes + MTU: 1500 bytes Coverage Length: 856 bytes + + UDP-Lite will ship the 1536 bytes in two separate packets:: + + Packet 1: 1024 payload + 8 byte header + 20 byte IP header = 1052 bytes + Packet 2: 512 payload + 8 byte header + 20 byte IP header = 540 bytes + + The coverage packet covers the UDP-Lite header and 848 bytes of the + payload in the first packet, the second packet is fully covered. Note + that for the second packet, the coverage length exceeds the packet + length. The kernel always re-adjusts the coverage length to the packet + length in such cases. + + As an example of what happens when one UDP-Lite packet is split into + several tiny fragments, consider the following example:: + + Payload: 1024 bytes Send buffer size: 1024 bytes + MTU: 300 bytes Coverage length: 575 bytes + + +-+-----------+--------------+--------------+--------------+ + |8| 272 | 280 | 280 | 280 | + +-+-----------+--------------+--------------+--------------+ + 280 560 840 1032 + ^ + *****checksum coverage************* + + The UDP-Lite module generates one 1032 byte packet (1024 + 8 byte + header). According to the interface MTU, these are split into 4 IP + packets (280 byte IP payload + 20 byte IP header). The kernel module + sums the contents of the entire first two packets, plus 15 bytes of + the last packet before releasing the fragments to the IP module. + + To see the analogous case for IPv6 fragmentation, consider a link + MTU of 1280 bytes and a write buffer of 3356 bytes. If the checksum + coverage is less than 1232 bytes (MTU minus IPv6/fragment header + lengths), only the first fragment needs to be considered. When using + larger checksum coverage lengths, each eligible fragment needs to be + checksummed. Suppose we have a checksum coverage of 3062. The buffer + of 3356 bytes will be split into the following fragments:: + + Fragment 1: 1280 bytes carrying 1232 bytes of UDP-Lite data + Fragment 2: 1280 bytes carrying 1232 bytes of UDP-Lite data + Fragment 3: 948 bytes carrying 900 bytes of UDP-Lite data + + The first two fragments have to be checksummed in full, of the last + fragment only 598 (= 3062 - 2*1232) bytes are checksummed. + + While it is important that such cases are dealt with correctly, they + are (annoyingly) rare: UDP-Lite is designed for optimising multimedia + performance over wireless (or generally noisy) links and thus smaller + coverage lengths are likely to be expected. + +5. UDP-Lite Runtime Statistics and their Meaning +================================================ + + Exceptional and error conditions are logged to syslog at the KERN_DEBUG + level. Live statistics about UDP-Lite are available in /proc/net/snmp + and can (with newer versions of netstat) be viewed using:: + + netstat -svu + + This displays UDP-Lite statistics variables, whose meaning is as follows. + + ============ ===================================================== + InDatagrams The total number of datagrams delivered to users. + + NoPorts Number of packets received to an unknown port. + These cases are counted separately (not as InErrors). + + InErrors Number of erroneous UDP-Lite packets. Errors include: + + * internal socket queue receive errors + * packet too short (less than 8 bytes or stated + coverage length exceeds received length) + * xfrm4_policy_check() returned with error + * application has specified larger min. coverage + length than that of incoming packet + * checksum coverage violated + * bad checksum + + OutDatagrams Total number of sent datagrams. + ============ ===================================================== + + These statistics derive from the UDP MIB (RFC 2013). + +6. IPtables +=========== + + There is packet match support for UDP-Lite as well as support for the LOG target. + If you copy and paste the following line into /etc/protocols:: + + udplite 136 UDP-Lite # UDP-Lite [RFC 3828] + + then:: + + iptables -A INPUT -p udplite -j LOG + + will produce logging output to syslog. Dropping and rejecting packets also works. + +7. Maintainer Address +===================== + + The UDP-Lite patch was developed at + + University of Aberdeen + Electronics Research Group + Department of Engineering + Fraser Noble Building + Aberdeen AB24 3UE; UK + + The current maintainer is Gerrit Renker, . Initial + code was developed by William Stanislaus, . diff --git a/Documentation/networking/udplite.txt b/Documentation/networking/udplite.txt deleted file mode 100644 index 53a726855e49..000000000000 --- a/Documentation/networking/udplite.txt +++ /dev/null @@ -1,278 +0,0 @@ - =========================================================================== - The UDP-Lite protocol (RFC 3828) - =========================================================================== - - - UDP-Lite is a Standards-Track IETF transport protocol whose characteristic - is a variable-length checksum. This has advantages for transport of multimedia - (video, VoIP) over wireless networks, as partly damaged packets can still be - fed into the codec instead of being discarded due to a failed checksum test. - - This file briefly describes the existing kernel support and the socket API. - For in-depth information, you can consult: - - o The UDP-Lite Homepage: - http://web.archive.org/web/*/http://www.erg.abdn.ac.uk/users/gerrit/udp-lite/ - From here you can also download some example application source code. - - o The UDP-Lite HOWTO on - http://web.archive.org/web/*/http://www.erg.abdn.ac.uk/users/gerrit/udp-lite/ - files/UDP-Lite-HOWTO.txt - - o The Wireshark UDP-Lite WiKi (with capture files): - https://wiki.wireshark.org/Lightweight_User_Datagram_Protocol - - o The Protocol Spec, RFC 3828, http://www.ietf.org/rfc/rfc3828.txt - - - I) APPLICATIONS - - Several applications have been ported successfully to UDP-Lite. Ethereal - (now called wireshark) has UDP-Litev4/v6 support by default. - Porting applications to UDP-Lite is straightforward: only socket level and - IPPROTO need to be changed; senders additionally set the checksum coverage - length (default = header length = 8). Details are in the next section. - - - II) PROGRAMMING API - - UDP-Lite provides a connectionless, unreliable datagram service and hence - uses the same socket type as UDP. In fact, porting from UDP to UDP-Lite is - very easy: simply add `IPPROTO_UDPLITE' as the last argument of the socket(2) - call so that the statement looks like: - - s = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDPLITE); - - or, respectively, - - s = socket(PF_INET6, SOCK_DGRAM, IPPROTO_UDPLITE); - - With just the above change you are able to run UDP-Lite services or connect - to UDP-Lite servers. The kernel will assume that you are not interested in - using partial checksum coverage and so emulate UDP mode (full coverage). - - To make use of the partial checksum coverage facilities requires setting a - single socket option, which takes an integer specifying the coverage length: - - * Sender checksum coverage: UDPLITE_SEND_CSCOV - - For example, - - int val = 20; - setsockopt(s, SOL_UDPLITE, UDPLITE_SEND_CSCOV, &val, sizeof(int)); - - sets the checksum coverage length to 20 bytes (12b data + 8b header). - Of each packet only the first 20 bytes (plus the pseudo-header) will be - checksummed. This is useful for RTP applications which have a 12-byte - base header. - - - * Receiver checksum coverage: UDPLITE_RECV_CSCOV - - This option is the receiver-side analogue. It is truly optional, i.e. not - required to enable traffic with partial checksum coverage. Its function is - that of a traffic filter: when enabled, it instructs the kernel to drop - all packets which have a coverage _less_ than this value. For example, if - RTP and UDP headers are to be protected, a receiver can enforce that only - packets with a minimum coverage of 20 are admitted: - - int min = 20; - setsockopt(s, SOL_UDPLITE, UDPLITE_RECV_CSCOV, &min, sizeof(int)); - - The calls to getsockopt(2) are analogous. Being an extension and not a stand- - alone protocol, all socket options known from UDP can be used in exactly the - same manner as before, e.g. UDP_CORK or UDP_ENCAP. - - A detailed discussion of UDP-Lite checksum coverage options is in section IV. - - - III) HEADER FILES - - The socket API requires support through header files in /usr/include: - - * /usr/include/netinet/in.h - to define IPPROTO_UDPLITE - - * /usr/include/netinet/udplite.h - for UDP-Lite header fields and protocol constants - - For testing purposes, the following can serve as a `mini' header file: - - #define IPPROTO_UDPLITE 136 - #define SOL_UDPLITE 136 - #define UDPLITE_SEND_CSCOV 10 - #define UDPLITE_RECV_CSCOV 11 - - Ready-made header files for various distros are in the UDP-Lite tarball. - - - IV) KERNEL BEHAVIOUR WITH REGARD TO THE VARIOUS SOCKET OPTIONS - - To enable debugging messages, the log level need to be set to 8, as most - messages use the KERN_DEBUG level (7). - - 1) Sender Socket Options - - If the sender specifies a value of 0 as coverage length, the module - assumes full coverage, transmits a packet with coverage length of 0 - and according checksum. If the sender specifies a coverage < 8 and - different from 0, the kernel assumes 8 as default value. Finally, - if the specified coverage length exceeds the packet length, the packet - length is used instead as coverage length. - - 2) Receiver Socket Options - - The receiver specifies the minimum value of the coverage length it - is willing to accept. A value of 0 here indicates that the receiver - always wants the whole of the packet covered. In this case, all - partially covered packets are dropped and an error is logged. - - It is not possible to specify illegal values (<0 and <8); in these - cases the default of 8 is assumed. - - All packets arriving with a coverage value less than the specified - threshold are discarded, these events are also logged. - - 3) Disabling the Checksum Computation - - On both sender and receiver, checksumming will always be performed - and cannot be disabled using SO_NO_CHECK. Thus - - setsockopt(sockfd, SOL_SOCKET, SO_NO_CHECK, ... ); - - will always will be ignored, while the value of - - getsockopt(sockfd, SOL_SOCKET, SO_NO_CHECK, &value, ...); - - is meaningless (as in TCP). Packets with a zero checksum field are - illegal (cf. RFC 3828, sec. 3.1) and will be silently discarded. - - 4) Fragmentation - - The checksum computation respects both buffersize and MTU. The size - of UDP-Lite packets is determined by the size of the send buffer. The - minimum size of the send buffer is 2048 (defined as SOCK_MIN_SNDBUF - in include/net/sock.h), the default value is configurable as - net.core.wmem_default or via setting the SO_SNDBUF socket(7) - option. The maximum upper bound for the send buffer is determined - by net.core.wmem_max. - - Given a payload size larger than the send buffer size, UDP-Lite will - split the payload into several individual packets, filling up the - send buffer size in each case. - - The precise value also depends on the interface MTU. The interface MTU, - in turn, may trigger IP fragmentation. In this case, the generated - UDP-Lite packet is split into several IP packets, of which only the - first one contains the L4 header. - - The send buffer size has implications on the checksum coverage length. - Consider the following example: - - Payload: 1536 bytes Send Buffer: 1024 bytes - MTU: 1500 bytes Coverage Length: 856 bytes - - UDP-Lite will ship the 1536 bytes in two separate packets: - - Packet 1: 1024 payload + 8 byte header + 20 byte IP header = 1052 bytes - Packet 2: 512 payload + 8 byte header + 20 byte IP header = 540 bytes - - The coverage packet covers the UDP-Lite header and 848 bytes of the - payload in the first packet, the second packet is fully covered. Note - that for the second packet, the coverage length exceeds the packet - length. The kernel always re-adjusts the coverage length to the packet - length in such cases. - - As an example of what happens when one UDP-Lite packet is split into - several tiny fragments, consider the following example. - - Payload: 1024 bytes Send buffer size: 1024 bytes - MTU: 300 bytes Coverage length: 575 bytes - - +-+-----------+--------------+--------------+--------------+ - |8| 272 | 280 | 280 | 280 | - +-+-----------+--------------+--------------+--------------+ - 280 560 840 1032 - ^ - *****checksum coverage************* - - The UDP-Lite module generates one 1032 byte packet (1024 + 8 byte - header). According to the interface MTU, these are split into 4 IP - packets (280 byte IP payload + 20 byte IP header). The kernel module - sums the contents of the entire first two packets, plus 15 bytes of - the last packet before releasing the fragments to the IP module. - - To see the analogous case for IPv6 fragmentation, consider a link - MTU of 1280 bytes and a write buffer of 3356 bytes. If the checksum - coverage is less than 1232 bytes (MTU minus IPv6/fragment header - lengths), only the first fragment needs to be considered. When using - larger checksum coverage lengths, each eligible fragment needs to be - checksummed. Suppose we have a checksum coverage of 3062. The buffer - of 3356 bytes will be split into the following fragments: - - Fragment 1: 1280 bytes carrying 1232 bytes of UDP-Lite data - Fragment 2: 1280 bytes carrying 1232 bytes of UDP-Lite data - Fragment 3: 948 bytes carrying 900 bytes of UDP-Lite data - - The first two fragments have to be checksummed in full, of the last - fragment only 598 (= 3062 - 2*1232) bytes are checksummed. - - While it is important that such cases are dealt with correctly, they - are (annoyingly) rare: UDP-Lite is designed for optimising multimedia - performance over wireless (or generally noisy) links and thus smaller - coverage lengths are likely to be expected. - - - V) UDP-LITE RUNTIME STATISTICS AND THEIR MEANING - - Exceptional and error conditions are logged to syslog at the KERN_DEBUG - level. Live statistics about UDP-Lite are available in /proc/net/snmp - and can (with newer versions of netstat) be viewed using - - netstat -svu - - This displays UDP-Lite statistics variables, whose meaning is as follows. - - InDatagrams: The total number of datagrams delivered to users. - - NoPorts: Number of packets received to an unknown port. - These cases are counted separately (not as InErrors). - - InErrors: Number of erroneous UDP-Lite packets. Errors include: - * internal socket queue receive errors - * packet too short (less than 8 bytes or stated - coverage length exceeds received length) - * xfrm4_policy_check() returned with error - * application has specified larger min. coverage - length than that of incoming packet - * checksum coverage violated - * bad checksum - - OutDatagrams: Total number of sent datagrams. - - These statistics derive from the UDP MIB (RFC 2013). - - - VI) IPTABLES - - There is packet match support for UDP-Lite as well as support for the LOG target. - If you copy and paste the following line into /etc/protocols, - - udplite 136 UDP-Lite # UDP-Lite [RFC 3828] - - then - iptables -A INPUT -p udplite -j LOG - - will produce logging output to syslog. Dropping and rejecting packets also works. - - - VII) MAINTAINER ADDRESS - - The UDP-Lite patch was developed at - University of Aberdeen - Electronics Research Group - Department of Engineering - Fraser Noble Building - Aberdeen AB24 3UE; UK - The current maintainer is Gerrit Renker, . Initial - code was developed by William Stanislaus, . -- cgit v1.2.3 From 58ccb2b2e87d52ec0b4cbd40b94e0b63e90af873 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:25 +0200 Subject: docs: networking: convert vrf.txt to ReST - add SPDX header; - adjust title markup; - Add a subtitle for the first section; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Acked-by: David Ahern Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/vrf.rst | 451 +++++++++++++++++++++++++++++++++++++ Documentation/networking/vrf.txt | 418 ---------------------------------- MAINTAINERS | 2 +- 4 files changed, 453 insertions(+), 419 deletions(-) create mode 100644 Documentation/networking/vrf.rst delete mode 100644 Documentation/networking/vrf.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index ca0b0dbfd9ad..2227b9f4509d 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -113,6 +113,7 @@ Contents: tproxy tuntap udplite + vrf .. only:: subproject and html diff --git a/Documentation/networking/vrf.rst b/Documentation/networking/vrf.rst new file mode 100644 index 000000000000..0dde145043bc --- /dev/null +++ b/Documentation/networking/vrf.rst @@ -0,0 +1,451 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================================== +Virtual Routing and Forwarding (VRF) +==================================== + +The VRF Device +============== + +The VRF device combined with ip rules provides the ability to create virtual +routing and forwarding domains (aka VRFs, VRF-lite to be specific) in the +Linux network stack. One use case is the multi-tenancy problem where each +tenant has their own unique routing tables and in the very least need +different default gateways. + +Processes can be "VRF aware" by binding a socket to the VRF device. Packets +through the socket then use the routing table associated with the VRF +device. An important feature of the VRF device implementation is that it +impacts only Layer 3 and above so L2 tools (e.g., LLDP) are not affected +(ie., they do not need to be run in each VRF). The design also allows +the use of higher priority ip rules (Policy Based Routing, PBR) to take +precedence over the VRF device rules directing specific traffic as desired. + +In addition, VRF devices allow VRFs to be nested within namespaces. For +example network namespaces provide separation of network interfaces at the +device layer, VLANs on the interfaces within a namespace provide L2 separation +and then VRF devices provide L3 separation. + +Design +------ +A VRF device is created with an associated route table. Network interfaces +are then enslaved to a VRF device:: + + +-----------------------------+ + | vrf-blue | ===> route table 10 + +-----------------------------+ + | | | + +------+ +------+ +-------------+ + | eth1 | | eth2 | ... | bond1 | + +------+ +------+ +-------------+ + | | + +------+ +------+ + | eth8 | | eth9 | + +------+ +------+ + +Packets received on an enslaved device and are switched to the VRF device +in the IPv4 and IPv6 processing stacks giving the impression that packets +flow through the VRF device. Similarly on egress routing rules are used to +send packets to the VRF device driver before getting sent out the actual +interface. This allows tcpdump on a VRF device to capture all packets into +and out of the VRF as a whole\ [1]_. Similarly, netfilter\ [2]_ and tc rules +can be applied using the VRF device to specify rules that apply to the VRF +domain as a whole. + +.. [1] Packets in the forwarded state do not flow through the device, so those + packets are not seen by tcpdump. Will revisit this limitation in a + future release. + +.. [2] Iptables on ingress supports PREROUTING with skb->dev set to the real + ingress device and both INPUT and PREROUTING rules with skb->dev set to + the VRF device. For egress POSTROUTING and OUTPUT rules can be written + using either the VRF device or real egress device. + +Setup +----- +1. VRF device is created with an association to a FIB table. + e.g,:: + + ip link add vrf-blue type vrf table 10 + ip link set dev vrf-blue up + +2. An l3mdev FIB rule directs lookups to the table associated with the device. + A single l3mdev rule is sufficient for all VRFs. The VRF device adds the + l3mdev rule for IPv4 and IPv6 when the first device is created with a + default preference of 1000. Users may delete the rule if desired and add + with a different priority or install per-VRF rules. + + Prior to the v4.8 kernel iif and oif rules are needed for each VRF device:: + + ip ru add oif vrf-blue table 10 + ip ru add iif vrf-blue table 10 + +3. Set the default route for the table (and hence default route for the VRF):: + + ip route add table 10 unreachable default metric 4278198272 + + This high metric value ensures that the default unreachable route can + be overridden by a routing protocol suite. FRRouting interprets + kernel metrics as a combined admin distance (upper byte) and priority + (lower 3 bytes). Thus the above metric translates to [255/8192]. + +4. Enslave L3 interfaces to a VRF device:: + + ip link set dev eth1 master vrf-blue + + Local and connected routes for enslaved devices are automatically moved to + the table associated with VRF device. Any additional routes depending on + the enslaved device are dropped and will need to be reinserted to the VRF + FIB table following the enslavement. + + The IPv6 sysctl option keep_addr_on_down can be enabled to keep IPv6 global + addresses as VRF enslavement changes:: + + sysctl -w net.ipv6.conf.all.keep_addr_on_down=1 + +5. Additional VRF routes are added to associated table:: + + ip route add table 10 ... + + +Applications +------------ +Applications that are to work within a VRF need to bind their socket to the +VRF device:: + + setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1); + +or to specify the output device using cmsg and IP_PKTINFO. + +By default the scope of the port bindings for unbound sockets is +limited to the default VRF. That is, it will not be matched by packets +arriving on interfaces enslaved to an l3mdev and processes may bind to +the same port if they bind to an l3mdev. + +TCP & UDP services running in the default VRF context (ie., not bound +to any VRF device) can work across all VRF domains by enabling the +tcp_l3mdev_accept and udp_l3mdev_accept sysctl options:: + + sysctl -w net.ipv4.tcp_l3mdev_accept=1 + sysctl -w net.ipv4.udp_l3mdev_accept=1 + +These options are disabled by default so that a socket in a VRF is only +selected for packets in that VRF. There is a similar option for RAW +sockets, which is enabled by default for reasons of backwards compatibility. +This is so as to specify the output device with cmsg and IP_PKTINFO, but +using a socket not bound to the corresponding VRF. This allows e.g. older ping +implementations to be run with specifying the device but without executing it +in the VRF. This option can be disabled so that packets received in a VRF +context are only handled by a raw socket bound to the VRF, and packets in the +default VRF are only handled by a socket not bound to any VRF:: + + sysctl -w net.ipv4.raw_l3mdev_accept=0 + +netfilter rules on the VRF device can be used to limit access to services +running in the default VRF context as well. + +-------------------------------------------------------------------------------- + +Using iproute2 for VRFs +======================= +iproute2 supports the vrf keyword as of v4.7. For backwards compatibility this +section lists both commands where appropriate -- with the vrf keyword and the +older form without it. + +1. Create a VRF + + To instantiate a VRF device and associate it with a table:: + + $ ip link add dev NAME type vrf table ID + + As of v4.8 the kernel supports the l3mdev FIB rule where a single rule + covers all VRFs. The l3mdev rule is created for IPv4 and IPv6 on first + device create. + +2. List VRFs + + To list VRFs that have been created:: + + $ ip [-d] link show type vrf + NOTE: The -d option is needed to show the table id + + For example:: + + $ ip -d link show type vrf + 11: mgmt: mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 + link/ether 72:b3:ba:91:e2:24 brd ff:ff:ff:ff:ff:ff promiscuity 0 + vrf table 1 addrgenmode eui64 + 12: red: mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 + link/ether b6:6f:6e:f6:da:73 brd ff:ff:ff:ff:ff:ff promiscuity 0 + vrf table 10 addrgenmode eui64 + 13: blue: mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 + link/ether 36:62:e8:7d:bb:8c brd ff:ff:ff:ff:ff:ff promiscuity 0 + vrf table 66 addrgenmode eui64 + 14: green: mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 + link/ether e6:28:b8:63:70:bb brd ff:ff:ff:ff:ff:ff promiscuity 0 + vrf table 81 addrgenmode eui64 + + + Or in brief output:: + + $ ip -br link show type vrf + mgmt UP 72:b3:ba:91:e2:24 + red UP b6:6f:6e:f6:da:73 + blue UP 36:62:e8:7d:bb:8c + green UP e6:28:b8:63:70:bb + + +3. Assign a Network Interface to a VRF + + Network interfaces are assigned to a VRF by enslaving the netdevice to a + VRF device:: + + $ ip link set dev NAME master NAME + + On enslavement connected and local routes are automatically moved to the + table associated with the VRF device. + + For example:: + + $ ip link set dev eth0 master mgmt + + +4. Show Devices Assigned to a VRF + + To show devices that have been assigned to a specific VRF add the master + option to the ip command:: + + $ ip link show vrf NAME + $ ip link show master NAME + + For example:: + + $ ip link show vrf red + 3: eth1: mtu 1500 qdisc pfifo_fast master red state UP mode DEFAULT group default qlen 1000 + link/ether 02:00:00:00:02:02 brd ff:ff:ff:ff:ff:ff + 4: eth2: mtu 1500 qdisc pfifo_fast master red state UP mode DEFAULT group default qlen 1000 + link/ether 02:00:00:00:02:03 brd ff:ff:ff:ff:ff:ff + 7: eth5: mtu 1500 qdisc noop master red state DOWN mode DEFAULT group default qlen 1000 + link/ether 02:00:00:00:02:06 brd ff:ff:ff:ff:ff:ff + + + Or using the brief output:: + + $ ip -br link show vrf red + eth1 UP 02:00:00:00:02:02 + eth2 UP 02:00:00:00:02:03 + eth5 DOWN 02:00:00:00:02:06 + + +5. Show Neighbor Entries for a VRF + + To list neighbor entries associated with devices enslaved to a VRF device + add the master option to the ip command:: + + $ ip [-6] neigh show vrf NAME + $ ip [-6] neigh show master NAME + + For example:: + + $ ip neigh show vrf red + 10.2.1.254 dev eth1 lladdr a6:d9:c7:4f:06:23 REACHABLE + 10.2.2.254 dev eth2 lladdr 5e:54:01:6a:ee:80 REACHABLE + + $ ip -6 neigh show vrf red + 2002:1::64 dev eth1 lladdr a6:d9:c7:4f:06:23 REACHABLE + + +6. Show Addresses for a VRF + + To show addresses for interfaces associated with a VRF add the master + option to the ip command:: + + $ ip addr show vrf NAME + $ ip addr show master NAME + + For example:: + + $ ip addr show vrf red + 3: eth1: mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 + link/ether 02:00:00:00:02:02 brd ff:ff:ff:ff:ff:ff + inet 10.2.1.2/24 brd 10.2.1.255 scope global eth1 + valid_lft forever preferred_lft forever + inet6 2002:1::2/120 scope global + valid_lft forever preferred_lft forever + inet6 fe80::ff:fe00:202/64 scope link + valid_lft forever preferred_lft forever + 4: eth2: mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 + link/ether 02:00:00:00:02:03 brd ff:ff:ff:ff:ff:ff + inet 10.2.2.2/24 brd 10.2.2.255 scope global eth2 + valid_lft forever preferred_lft forever + inet6 2002:2::2/120 scope global + valid_lft forever preferred_lft forever + inet6 fe80::ff:fe00:203/64 scope link + valid_lft forever preferred_lft forever + 7: eth5: mtu 1500 qdisc noop master red state DOWN group default qlen 1000 + link/ether 02:00:00:00:02:06 brd ff:ff:ff:ff:ff:ff + + Or in brief format:: + + $ ip -br addr show vrf red + eth1 UP 10.2.1.2/24 2002:1::2/120 fe80::ff:fe00:202/64 + eth2 UP 10.2.2.2/24 2002:2::2/120 fe80::ff:fe00:203/64 + eth5 DOWN + + +7. Show Routes for a VRF + + To show routes for a VRF use the ip command to display the table associated + with the VRF device:: + + $ ip [-6] route show vrf NAME + $ ip [-6] route show table ID + + For example:: + + $ ip route show vrf red + unreachable default metric 4278198272 + broadcast 10.2.1.0 dev eth1 proto kernel scope link src 10.2.1.2 + 10.2.1.0/24 dev eth1 proto kernel scope link src 10.2.1.2 + local 10.2.1.2 dev eth1 proto kernel scope host src 10.2.1.2 + broadcast 10.2.1.255 dev eth1 proto kernel scope link src 10.2.1.2 + broadcast 10.2.2.0 dev eth2 proto kernel scope link src 10.2.2.2 + 10.2.2.0/24 dev eth2 proto kernel scope link src 10.2.2.2 + local 10.2.2.2 dev eth2 proto kernel scope host src 10.2.2.2 + broadcast 10.2.2.255 dev eth2 proto kernel scope link src 10.2.2.2 + + $ ip -6 route show vrf red + local 2002:1:: dev lo proto none metric 0 pref medium + local 2002:1::2 dev lo proto none metric 0 pref medium + 2002:1::/120 dev eth1 proto kernel metric 256 pref medium + local 2002:2:: dev lo proto none metric 0 pref medium + local 2002:2::2 dev lo proto none metric 0 pref medium + 2002:2::/120 dev eth2 proto kernel metric 256 pref medium + local fe80:: dev lo proto none metric 0 pref medium + local fe80:: dev lo proto none metric 0 pref medium + local fe80::ff:fe00:202 dev lo proto none metric 0 pref medium + local fe80::ff:fe00:203 dev lo proto none metric 0 pref medium + fe80::/64 dev eth1 proto kernel metric 256 pref medium + fe80::/64 dev eth2 proto kernel metric 256 pref medium + ff00::/8 dev red metric 256 pref medium + ff00::/8 dev eth1 metric 256 pref medium + ff00::/8 dev eth2 metric 256 pref medium + unreachable default dev lo metric 4278198272 error -101 pref medium + +8. Route Lookup for a VRF + + A test route lookup can be done for a VRF:: + + $ ip [-6] route get vrf NAME ADDRESS + $ ip [-6] route get oif NAME ADDRESS + + For example:: + + $ ip route get 10.2.1.40 vrf red + 10.2.1.40 dev eth1 table red src 10.2.1.2 + cache + + $ ip -6 route get 2002:1::32 vrf red + 2002:1::32 from :: dev eth1 table red proto kernel src 2002:1::2 metric 256 pref medium + + +9. Removing Network Interface from a VRF + + Network interfaces are removed from a VRF by breaking the enslavement to + the VRF device:: + + $ ip link set dev NAME nomaster + + Connected routes are moved back to the default table and local entries are + moved to the local table. + + For example:: + + $ ip link set dev eth0 nomaster + +-------------------------------------------------------------------------------- + +Commands used in this example:: + + cat >> /etc/iproute2/rt_tables.d/vrf.conf < route table 10 - +-----------------------------+ - | | | - +------+ +------+ +-------------+ - | eth1 | | eth2 | ... | bond1 | - +------+ +------+ +-------------+ - | | - +------+ +------+ - | eth8 | | eth9 | - +------+ +------+ - -Packets received on an enslaved device and are switched to the VRF device -in the IPv4 and IPv6 processing stacks giving the impression that packets -flow through the VRF device. Similarly on egress routing rules are used to -send packets to the VRF device driver before getting sent out the actual -interface. This allows tcpdump on a VRF device to capture all packets into -and out of the VRF as a whole.[1] Similarly, netfilter[2] and tc rules can be -applied using the VRF device to specify rules that apply to the VRF domain -as a whole. - -[1] Packets in the forwarded state do not flow through the device, so those - packets are not seen by tcpdump. Will revisit this limitation in a - future release. - -[2] Iptables on ingress supports PREROUTING with skb->dev set to the real - ingress device and both INPUT and PREROUTING rules with skb->dev set to - the VRF device. For egress POSTROUTING and OUTPUT rules can be written - using either the VRF device or real egress device. - -Setup ------ -1. VRF device is created with an association to a FIB table. - e.g, ip link add vrf-blue type vrf table 10 - ip link set dev vrf-blue up - -2. An l3mdev FIB rule directs lookups to the table associated with the device. - A single l3mdev rule is sufficient for all VRFs. The VRF device adds the - l3mdev rule for IPv4 and IPv6 when the first device is created with a - default preference of 1000. Users may delete the rule if desired and add - with a different priority or install per-VRF rules. - - Prior to the v4.8 kernel iif and oif rules are needed for each VRF device: - ip ru add oif vrf-blue table 10 - ip ru add iif vrf-blue table 10 - -3. Set the default route for the table (and hence default route for the VRF). - ip route add table 10 unreachable default metric 4278198272 - - This high metric value ensures that the default unreachable route can - be overridden by a routing protocol suite. FRRouting interprets - kernel metrics as a combined admin distance (upper byte) and priority - (lower 3 bytes). Thus the above metric translates to [255/8192]. - -4. Enslave L3 interfaces to a VRF device. - ip link set dev eth1 master vrf-blue - - Local and connected routes for enslaved devices are automatically moved to - the table associated with VRF device. Any additional routes depending on - the enslaved device are dropped and will need to be reinserted to the VRF - FIB table following the enslavement. - - The IPv6 sysctl option keep_addr_on_down can be enabled to keep IPv6 global - addresses as VRF enslavement changes. - sysctl -w net.ipv6.conf.all.keep_addr_on_down=1 - -5. Additional VRF routes are added to associated table. - ip route add table 10 ... - - -Applications ------------- -Applications that are to work within a VRF need to bind their socket to the -VRF device: - - setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1); - -or to specify the output device using cmsg and IP_PKTINFO. - -By default the scope of the port bindings for unbound sockets is -limited to the default VRF. That is, it will not be matched by packets -arriving on interfaces enslaved to an l3mdev and processes may bind to -the same port if they bind to an l3mdev. - -TCP & UDP services running in the default VRF context (ie., not bound -to any VRF device) can work across all VRF domains by enabling the -tcp_l3mdev_accept and udp_l3mdev_accept sysctl options: - - sysctl -w net.ipv4.tcp_l3mdev_accept=1 - sysctl -w net.ipv4.udp_l3mdev_accept=1 - -These options are disabled by default so that a socket in a VRF is only -selected for packets in that VRF. There is a similar option for RAW -sockets, which is enabled by default for reasons of backwards compatibility. -This is so as to specify the output device with cmsg and IP_PKTINFO, but -using a socket not bound to the corresponding VRF. This allows e.g. older ping -implementations to be run with specifying the device but without executing it -in the VRF. This option can be disabled so that packets received in a VRF -context are only handled by a raw socket bound to the VRF, and packets in the -default VRF are only handled by a socket not bound to any VRF: - - sysctl -w net.ipv4.raw_l3mdev_accept=0 - -netfilter rules on the VRF device can be used to limit access to services -running in the default VRF context as well. - -################################################################################ - -Using iproute2 for VRFs -======================= -iproute2 supports the vrf keyword as of v4.7. For backwards compatibility this -section lists both commands where appropriate -- with the vrf keyword and the -older form without it. - -1. Create a VRF - - To instantiate a VRF device and associate it with a table: - $ ip link add dev NAME type vrf table ID - - As of v4.8 the kernel supports the l3mdev FIB rule where a single rule - covers all VRFs. The l3mdev rule is created for IPv4 and IPv6 on first - device create. - -2. List VRFs - - To list VRFs that have been created: - $ ip [-d] link show type vrf - NOTE: The -d option is needed to show the table id - - For example: - $ ip -d link show type vrf - 11: mgmt: mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 - link/ether 72:b3:ba:91:e2:24 brd ff:ff:ff:ff:ff:ff promiscuity 0 - vrf table 1 addrgenmode eui64 - 12: red: mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 - link/ether b6:6f:6e:f6:da:73 brd ff:ff:ff:ff:ff:ff promiscuity 0 - vrf table 10 addrgenmode eui64 - 13: blue: mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 - link/ether 36:62:e8:7d:bb:8c brd ff:ff:ff:ff:ff:ff promiscuity 0 - vrf table 66 addrgenmode eui64 - 14: green: mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 - link/ether e6:28:b8:63:70:bb brd ff:ff:ff:ff:ff:ff promiscuity 0 - vrf table 81 addrgenmode eui64 - - - Or in brief output: - - $ ip -br link show type vrf - mgmt UP 72:b3:ba:91:e2:24 - red UP b6:6f:6e:f6:da:73 - blue UP 36:62:e8:7d:bb:8c - green UP e6:28:b8:63:70:bb - - -3. Assign a Network Interface to a VRF - - Network interfaces are assigned to a VRF by enslaving the netdevice to a - VRF device: - $ ip link set dev NAME master NAME - - On enslavement connected and local routes are automatically moved to the - table associated with the VRF device. - - For example: - $ ip link set dev eth0 master mgmt - - -4. Show Devices Assigned to a VRF - - To show devices that have been assigned to a specific VRF add the master - option to the ip command: - $ ip link show vrf NAME - $ ip link show master NAME - - For example: - $ ip link show vrf red - 3: eth1: mtu 1500 qdisc pfifo_fast master red state UP mode DEFAULT group default qlen 1000 - link/ether 02:00:00:00:02:02 brd ff:ff:ff:ff:ff:ff - 4: eth2: mtu 1500 qdisc pfifo_fast master red state UP mode DEFAULT group default qlen 1000 - link/ether 02:00:00:00:02:03 brd ff:ff:ff:ff:ff:ff - 7: eth5: mtu 1500 qdisc noop master red state DOWN mode DEFAULT group default qlen 1000 - link/ether 02:00:00:00:02:06 brd ff:ff:ff:ff:ff:ff - - - Or using the brief output: - $ ip -br link show vrf red - eth1 UP 02:00:00:00:02:02 - eth2 UP 02:00:00:00:02:03 - eth5 DOWN 02:00:00:00:02:06 - - -5. Show Neighbor Entries for a VRF - - To list neighbor entries associated with devices enslaved to a VRF device - add the master option to the ip command: - $ ip [-6] neigh show vrf NAME - $ ip [-6] neigh show master NAME - - For example: - $ ip neigh show vrf red - 10.2.1.254 dev eth1 lladdr a6:d9:c7:4f:06:23 REACHABLE - 10.2.2.254 dev eth2 lladdr 5e:54:01:6a:ee:80 REACHABLE - - $ ip -6 neigh show vrf red - 2002:1::64 dev eth1 lladdr a6:d9:c7:4f:06:23 REACHABLE - - -6. Show Addresses for a VRF - - To show addresses for interfaces associated with a VRF add the master - option to the ip command: - $ ip addr show vrf NAME - $ ip addr show master NAME - - For example: - $ ip addr show vrf red - 3: eth1: mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 - link/ether 02:00:00:00:02:02 brd ff:ff:ff:ff:ff:ff - inet 10.2.1.2/24 brd 10.2.1.255 scope global eth1 - valid_lft forever preferred_lft forever - inet6 2002:1::2/120 scope global - valid_lft forever preferred_lft forever - inet6 fe80::ff:fe00:202/64 scope link - valid_lft forever preferred_lft forever - 4: eth2: mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 - link/ether 02:00:00:00:02:03 brd ff:ff:ff:ff:ff:ff - inet 10.2.2.2/24 brd 10.2.2.255 scope global eth2 - valid_lft forever preferred_lft forever - inet6 2002:2::2/120 scope global - valid_lft forever preferred_lft forever - inet6 fe80::ff:fe00:203/64 scope link - valid_lft forever preferred_lft forever - 7: eth5: mtu 1500 qdisc noop master red state DOWN group default qlen 1000 - link/ether 02:00:00:00:02:06 brd ff:ff:ff:ff:ff:ff - - Or in brief format: - $ ip -br addr show vrf red - eth1 UP 10.2.1.2/24 2002:1::2/120 fe80::ff:fe00:202/64 - eth2 UP 10.2.2.2/24 2002:2::2/120 fe80::ff:fe00:203/64 - eth5 DOWN - - -7. Show Routes for a VRF - - To show routes for a VRF use the ip command to display the table associated - with the VRF device: - $ ip [-6] route show vrf NAME - $ ip [-6] route show table ID - - For example: - $ ip route show vrf red - unreachable default metric 4278198272 - broadcast 10.2.1.0 dev eth1 proto kernel scope link src 10.2.1.2 - 10.2.1.0/24 dev eth1 proto kernel scope link src 10.2.1.2 - local 10.2.1.2 dev eth1 proto kernel scope host src 10.2.1.2 - broadcast 10.2.1.255 dev eth1 proto kernel scope link src 10.2.1.2 - broadcast 10.2.2.0 dev eth2 proto kernel scope link src 10.2.2.2 - 10.2.2.0/24 dev eth2 proto kernel scope link src 10.2.2.2 - local 10.2.2.2 dev eth2 proto kernel scope host src 10.2.2.2 - broadcast 10.2.2.255 dev eth2 proto kernel scope link src 10.2.2.2 - - $ ip -6 route show vrf red - local 2002:1:: dev lo proto none metric 0 pref medium - local 2002:1::2 dev lo proto none metric 0 pref medium - 2002:1::/120 dev eth1 proto kernel metric 256 pref medium - local 2002:2:: dev lo proto none metric 0 pref medium - local 2002:2::2 dev lo proto none metric 0 pref medium - 2002:2::/120 dev eth2 proto kernel metric 256 pref medium - local fe80:: dev lo proto none metric 0 pref medium - local fe80:: dev lo proto none metric 0 pref medium - local fe80::ff:fe00:202 dev lo proto none metric 0 pref medium - local fe80::ff:fe00:203 dev lo proto none metric 0 pref medium - fe80::/64 dev eth1 proto kernel metric 256 pref medium - fe80::/64 dev eth2 proto kernel metric 256 pref medium - ff00::/8 dev red metric 256 pref medium - ff00::/8 dev eth1 metric 256 pref medium - ff00::/8 dev eth2 metric 256 pref medium - unreachable default dev lo metric 4278198272 error -101 pref medium - -8. Route Lookup for a VRF - - A test route lookup can be done for a VRF: - $ ip [-6] route get vrf NAME ADDRESS - $ ip [-6] route get oif NAME ADDRESS - - For example: - $ ip route get 10.2.1.40 vrf red - 10.2.1.40 dev eth1 table red src 10.2.1.2 - cache - - $ ip -6 route get 2002:1::32 vrf red - 2002:1::32 from :: dev eth1 table red proto kernel src 2002:1::2 metric 256 pref medium - - -9. Removing Network Interface from a VRF - - Network interfaces are removed from a VRF by breaking the enslavement to - the VRF device: - $ ip link set dev NAME nomaster - - Connected routes are moved back to the default table and local entries are - moved to the local table. - - For example: - $ ip link set dev eth0 nomaster - --------------------------------------------------------------------------------- - -Commands used in this example: - -cat >> /etc/iproute2/rt_tables.d/vrf.conf < M: Shrijeet Mukherjee L: netdev@vger.kernel.org S: Maintained -F: Documentation/networking/vrf.txt +F: Documentation/networking/vrf.rst F: drivers/net/vrf.c VSPRINTF -- cgit v1.2.3 From d2a85c184ac6e738daa5e42f89b1f353910d6a89 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:26 +0200 Subject: docs: networking: convert vxlan.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/vxlan.rst | 60 ++++++++++++++++++++++++++++++++++++++ Documentation/networking/vxlan.txt | 51 -------------------------------- 3 files changed, 61 insertions(+), 51 deletions(-) create mode 100644 Documentation/networking/vxlan.rst delete mode 100644 Documentation/networking/vxlan.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 2227b9f4509d..a72fdfb391b6 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -114,6 +114,7 @@ Contents: tuntap udplite vrf + vxlan .. only:: subproject and html diff --git a/Documentation/networking/vxlan.rst b/Documentation/networking/vxlan.rst new file mode 100644 index 000000000000..ce239fa01848 --- /dev/null +++ b/Documentation/networking/vxlan.rst @@ -0,0 +1,60 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================================================== +Virtual eXtensible Local Area Networking documentation +====================================================== + +The VXLAN protocol is a tunnelling protocol designed to solve the +problem of limited VLAN IDs (4096) in IEEE 802.1q. With VXLAN the +size of the identifier is expanded to 24 bits (16777216). + +VXLAN is described by IETF RFC 7348, and has been implemented by a +number of vendors. The protocol runs over UDP using a single +destination port. This document describes the Linux kernel tunnel +device, there is also a separate implementation of VXLAN for +Openvswitch. + +Unlike most tunnels, a VXLAN is a 1 to N network, not just point to +point. A VXLAN device can learn the IP address of the other endpoint +either dynamically in a manner similar to a learning bridge, or make +use of statically-configured forwarding entries. + +The management of vxlan is done in a manner similar to its two closest +neighbors GRE and VLAN. Configuring VXLAN requires the version of +iproute2 that matches the kernel release where VXLAN was first merged +upstream. + +1. Create vxlan device:: + + # ip link add vxlan0 type vxlan id 42 group 239.1.1.1 dev eth1 dstport 4789 + +This creates a new device named vxlan0. The device uses the multicast +group 239.1.1.1 over eth1 to handle traffic for which there is no +entry in the forwarding table. The destination port number is set to +the IANA-assigned value of 4789. The Linux implementation of VXLAN +pre-dates the IANA's selection of a standard destination port number +and uses the Linux-selected value by default to maintain backwards +compatibility. + +2. Delete vxlan device:: + + # ip link delete vxlan0 + +3. Show vxlan info:: + + # ip -d link show vxlan0 + +It is possible to create, destroy and display the vxlan +forwarding table using the new bridge command. + +1. Create forwarding table entry:: + + # bridge fdb add to 00:17:42:8a:b4:05 dst 192.19.0.2 dev vxlan0 + +2. Delete forwarding table entry:: + + # bridge fdb delete 00:17:42:8a:b4:05 dev vxlan0 + +3. Show forwarding table:: + + # bridge fdb show dev vxlan0 diff --git a/Documentation/networking/vxlan.txt b/Documentation/networking/vxlan.txt deleted file mode 100644 index c28f4989c3f0..000000000000 --- a/Documentation/networking/vxlan.txt +++ /dev/null @@ -1,51 +0,0 @@ -Virtual eXtensible Local Area Networking documentation -====================================================== - -The VXLAN protocol is a tunnelling protocol designed to solve the -problem of limited VLAN IDs (4096) in IEEE 802.1q. With VXLAN the -size of the identifier is expanded to 24 bits (16777216). - -VXLAN is described by IETF RFC 7348, and has been implemented by a -number of vendors. The protocol runs over UDP using a single -destination port. This document describes the Linux kernel tunnel -device, there is also a separate implementation of VXLAN for -Openvswitch. - -Unlike most tunnels, a VXLAN is a 1 to N network, not just point to -point. A VXLAN device can learn the IP address of the other endpoint -either dynamically in a manner similar to a learning bridge, or make -use of statically-configured forwarding entries. - -The management of vxlan is done in a manner similar to its two closest -neighbors GRE and VLAN. Configuring VXLAN requires the version of -iproute2 that matches the kernel release where VXLAN was first merged -upstream. - -1. Create vxlan device - # ip link add vxlan0 type vxlan id 42 group 239.1.1.1 dev eth1 dstport 4789 - -This creates a new device named vxlan0. The device uses the multicast -group 239.1.1.1 over eth1 to handle traffic for which there is no -entry in the forwarding table. The destination port number is set to -the IANA-assigned value of 4789. The Linux implementation of VXLAN -pre-dates the IANA's selection of a standard destination port number -and uses the Linux-selected value by default to maintain backwards -compatibility. - -2. Delete vxlan device - # ip link delete vxlan0 - -3. Show vxlan info - # ip -d link show vxlan0 - -It is possible to create, destroy and display the vxlan -forwarding table using the new bridge command. - -1. Create forwarding table entry - # bridge fdb add to 00:17:42:8a:b4:05 dst 192.19.0.2 dev vxlan0 - -2. Delete forwarding table entry - # bridge fdb delete 00:17:42:8a:b4:05 dev vxlan0 - -3. Show forwarding table - # bridge fdb show dev vxlan0 -- cgit v1.2.3 From 883780af72090daf9ab53779a3085a6ddfc468ca Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:27 +0200 Subject: docs: networking: convert x25-iface.txt to ReST Not much to be done here: - add SPDX header; - adjust title markup; - remove a tail whitespace; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/x25-iface.rst | 129 +++++++++++++++++++++++++++++++++ Documentation/networking/x25-iface.txt | 123 ------------------------------- include/uapi/linux/if_x25.h | 2 +- net/x25/Kconfig | 2 +- 5 files changed, 132 insertions(+), 125 deletions(-) create mode 100644 Documentation/networking/x25-iface.rst delete mode 100644 Documentation/networking/x25-iface.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index a72fdfb391b6..7a4bdbc111b0 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -115,6 +115,7 @@ Contents: udplite vrf vxlan + x25-iface .. only:: subproject and html diff --git a/Documentation/networking/x25-iface.rst b/Documentation/networking/x25-iface.rst new file mode 100644 index 000000000000..df401891dce6 --- /dev/null +++ b/Documentation/networking/x25-iface.rst @@ -0,0 +1,129 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================- +X.25 Device Driver Interface +============================- + +Version 1.1 + + Jonathan Naylor 26.12.96 + +This is a description of the messages to be passed between the X.25 Packet +Layer and the X.25 device driver. They are designed to allow for the easy +setting of the LAPB mode from within the Packet Layer. + +The X.25 device driver will be coded normally as per the Linux device driver +standards. Most X.25 device drivers will be moderately similar to the +already existing Ethernet device drivers. However unlike those drivers, the +X.25 device driver has a state associated with it, and this information +needs to be passed to and from the Packet Layer for proper operation. + +All messages are held in sk_buff's just like real data to be transmitted +over the LAPB link. The first byte of the skbuff indicates the meaning of +the rest of the skbuff, if any more information does exist. + + +Packet Layer to Device Driver +----------------------------- + +First Byte = 0x00 (X25_IFACE_DATA) + +This indicates that the rest of the skbuff contains data to be transmitted +over the LAPB link. The LAPB link should already exist before any data is +passed down. + +First Byte = 0x01 (X25_IFACE_CONNECT) + +Establish the LAPB link. If the link is already established then the connect +confirmation message should be returned as soon as possible. + +First Byte = 0x02 (X25_IFACE_DISCONNECT) + +Terminate the LAPB link. If it is already disconnected then the disconnect +confirmation message should be returned as soon as possible. + +First Byte = 0x03 (X25_IFACE_PARAMS) + +LAPB parameters. To be defined. + + +Device Driver to Packet Layer +----------------------------- + +First Byte = 0x00 (X25_IFACE_DATA) + +This indicates that the rest of the skbuff contains data that has been +received over the LAPB link. + +First Byte = 0x01 (X25_IFACE_CONNECT) + +LAPB link has been established. The same message is used for both a LAPB +link connect_confirmation and a connect_indication. + +First Byte = 0x02 (X25_IFACE_DISCONNECT) + +LAPB link has been terminated. This same message is used for both a LAPB +link disconnect_confirmation and a disconnect_indication. + +First Byte = 0x03 (X25_IFACE_PARAMS) + +LAPB parameters. To be defined. + + + +Possible Problems +================= + +(Henner Eisen, 2000-10-28) + +The X.25 packet layer protocol depends on a reliable datalink service. +The LAPB protocol provides such reliable service. But this reliability +is not preserved by the Linux network device driver interface: + +- With Linux 2.4.x (and above) SMP kernels, packet ordering is not + preserved. Even if a device driver calls netif_rx(skb1) and later + netif_rx(skb2), skb2 might be delivered to the network layer + earlier that skb1. +- Data passed upstream by means of netif_rx() might be dropped by the + kernel if the backlog queue is congested. + +The X.25 packet layer protocol will detect this and reset the virtual +call in question. But many upper layer protocols are not designed to +handle such N-Reset events gracefully. And frequent N-Reset events +will always degrade performance. + +Thus, driver authors should make netif_rx() as reliable as possible: + +SMP re-ordering will not occur if the driver's interrupt handler is +always executed on the same CPU. Thus, + +- Driver authors should use irq affinity for the interrupt handler. + +The probability of packet loss due to backlog congestion can be +reduced by the following measures or a combination thereof: + +(1) Drivers for kernel versions 2.4.x and above should always check the + return value of netif_rx(). If it returns NET_RX_DROP, the + driver's LAPB protocol must not confirm reception of the frame + to the peer. + This will reliably suppress packet loss. The LAPB protocol will + automatically cause the peer to re-transmit the dropped packet + later. + The lapb module interface was modified to support this. Its + data_indication() method should now transparently pass the + netif_rx() return value to the (lapb module) caller. +(2) Drivers for kernel versions 2.2.x should always check the global + variable netdev_dropping when a new frame is received. The driver + should only call netif_rx() if netdev_dropping is zero. Otherwise + the driver should not confirm delivery of the frame and drop it. + Alternatively, the driver can queue the frame internally and call + netif_rx() later when netif_dropping is 0 again. In that case, delivery + confirmation should also be deferred such that the internal queue + cannot grow to much. + This will not reliably avoid packet loss, but the probability + of packet loss in netif_rx() path will be significantly reduced. +(3) Additionally, driver authors might consider to support + CONFIG_NET_HW_FLOWCONTROL. This allows the driver to be woken up + when a previously congested backlog queue becomes empty again. + The driver could uses this for flow-controlling the peer by means + of the LAPB protocol's flow-control service. diff --git a/Documentation/networking/x25-iface.txt b/Documentation/networking/x25-iface.txt deleted file mode 100644 index 7f213b556e85..000000000000 --- a/Documentation/networking/x25-iface.txt +++ /dev/null @@ -1,123 +0,0 @@ - X.25 Device Driver Interface 1.1 - - Jonathan Naylor 26.12.96 - -This is a description of the messages to be passed between the X.25 Packet -Layer and the X.25 device driver. They are designed to allow for the easy -setting of the LAPB mode from within the Packet Layer. - -The X.25 device driver will be coded normally as per the Linux device driver -standards. Most X.25 device drivers will be moderately similar to the -already existing Ethernet device drivers. However unlike those drivers, the -X.25 device driver has a state associated with it, and this information -needs to be passed to and from the Packet Layer for proper operation. - -All messages are held in sk_buff's just like real data to be transmitted -over the LAPB link. The first byte of the skbuff indicates the meaning of -the rest of the skbuff, if any more information does exist. - - -Packet Layer to Device Driver ------------------------------ - -First Byte = 0x00 (X25_IFACE_DATA) - -This indicates that the rest of the skbuff contains data to be transmitted -over the LAPB link. The LAPB link should already exist before any data is -passed down. - -First Byte = 0x01 (X25_IFACE_CONNECT) - -Establish the LAPB link. If the link is already established then the connect -confirmation message should be returned as soon as possible. - -First Byte = 0x02 (X25_IFACE_DISCONNECT) - -Terminate the LAPB link. If it is already disconnected then the disconnect -confirmation message should be returned as soon as possible. - -First Byte = 0x03 (X25_IFACE_PARAMS) - -LAPB parameters. To be defined. - - -Device Driver to Packet Layer ------------------------------ - -First Byte = 0x00 (X25_IFACE_DATA) - -This indicates that the rest of the skbuff contains data that has been -received over the LAPB link. - -First Byte = 0x01 (X25_IFACE_CONNECT) - -LAPB link has been established. The same message is used for both a LAPB -link connect_confirmation and a connect_indication. - -First Byte = 0x02 (X25_IFACE_DISCONNECT) - -LAPB link has been terminated. This same message is used for both a LAPB -link disconnect_confirmation and a disconnect_indication. - -First Byte = 0x03 (X25_IFACE_PARAMS) - -LAPB parameters. To be defined. - - - -Possible Problems -================= - -(Henner Eisen, 2000-10-28) - -The X.25 packet layer protocol depends on a reliable datalink service. -The LAPB protocol provides such reliable service. But this reliability -is not preserved by the Linux network device driver interface: - -- With Linux 2.4.x (and above) SMP kernels, packet ordering is not - preserved. Even if a device driver calls netif_rx(skb1) and later - netif_rx(skb2), skb2 might be delivered to the network layer - earlier that skb1. -- Data passed upstream by means of netif_rx() might be dropped by the - kernel if the backlog queue is congested. - -The X.25 packet layer protocol will detect this and reset the virtual -call in question. But many upper layer protocols are not designed to -handle such N-Reset events gracefully. And frequent N-Reset events -will always degrade performance. - -Thus, driver authors should make netif_rx() as reliable as possible: - -SMP re-ordering will not occur if the driver's interrupt handler is -always executed on the same CPU. Thus, - -- Driver authors should use irq affinity for the interrupt handler. - -The probability of packet loss due to backlog congestion can be -reduced by the following measures or a combination thereof: - -(1) Drivers for kernel versions 2.4.x and above should always check the - return value of netif_rx(). If it returns NET_RX_DROP, the - driver's LAPB protocol must not confirm reception of the frame - to the peer. - This will reliably suppress packet loss. The LAPB protocol will - automatically cause the peer to re-transmit the dropped packet - later. - The lapb module interface was modified to support this. Its - data_indication() method should now transparently pass the - netif_rx() return value to the (lapb module) caller. -(2) Drivers for kernel versions 2.2.x should always check the global - variable netdev_dropping when a new frame is received. The driver - should only call netif_rx() if netdev_dropping is zero. Otherwise - the driver should not confirm delivery of the frame and drop it. - Alternatively, the driver can queue the frame internally and call - netif_rx() later when netif_dropping is 0 again. In that case, delivery - confirmation should also be deferred such that the internal queue - cannot grow to much. - This will not reliably avoid packet loss, but the probability - of packet loss in netif_rx() path will be significantly reduced. -(3) Additionally, driver authors might consider to support - CONFIG_NET_HW_FLOWCONTROL. This allows the driver to be woken up - when a previously congested backlog queue becomes empty again. - The driver could uses this for flow-controlling the peer by means - of the LAPB protocol's flow-control service. diff --git a/include/uapi/linux/if_x25.h b/include/uapi/linux/if_x25.h index 5d962448345f..3a5938e38370 100644 --- a/include/uapi/linux/if_x25.h +++ b/include/uapi/linux/if_x25.h @@ -18,7 +18,7 @@ #include -/* Documentation/networking/x25-iface.txt */ +/* Documentation/networking/x25-iface.rst */ #define X25_IFACE_DATA 0x00 #define X25_IFACE_CONNECT 0x01 #define X25_IFACE_DISCONNECT 0x02 diff --git a/net/x25/Kconfig b/net/x25/Kconfig index 2ecb2e5e241e..a328f79885d1 100644 --- a/net/x25/Kconfig +++ b/net/x25/Kconfig @@ -21,7 +21,7 @@ config X25 . Information about X.25 for Linux is contained in the files and - . + . One connects to an X.25 network either with a dedicated network card using the X.21 protocol (not yet supported by Linux) or one can do -- cgit v1.2.3 From c4ea03fdfd122b4ff293bff643c2369852e9cc1c Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:28 +0200 Subject: docs: networking: convert x25.txt to ReST Not much to be done here: - add SPDX header; - add a document title; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/x25.rst | 48 ++++++++++++++++++++++++++++++++++++++ Documentation/networking/x25.txt | 44 ---------------------------------- net/x25/Kconfig | 2 +- 4 files changed, 50 insertions(+), 45 deletions(-) create mode 100644 Documentation/networking/x25.rst delete mode 100644 Documentation/networking/x25.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 7a4bdbc111b0..75521e6c473b 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -116,6 +116,7 @@ Contents: vrf vxlan x25-iface + x25 .. only:: subproject and html diff --git a/Documentation/networking/x25.rst b/Documentation/networking/x25.rst new file mode 100644 index 000000000000..00e45d384ba0 --- /dev/null +++ b/Documentation/networking/x25.rst @@ -0,0 +1,48 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================== +Linux X.25 Project +================== + +As my third year dissertation at University I have taken it upon myself to +write an X.25 implementation for Linux. My aim is to provide a complete X.25 +Packet Layer and a LAPB module to allow for "normal" X.25 to be run using +Linux. There are two sorts of X.25 cards available, intelligent ones that +implement LAPB on the card itself, and unintelligent ones that simply do +framing, bit-stuffing and checksumming. These both need to be handled by the +system. + +I therefore decided to write the implementation such that as far as the +Packet Layer is concerned, the link layer was being performed by a lower +layer of the Linux kernel and therefore it did not concern itself with +implementation of LAPB. Therefore the LAPB modules would be called by +unintelligent X.25 card drivers and not by intelligent ones, this would +provide a uniform device driver interface, and simplify configuration. + +To confuse matters a little, an 802.2 LLC implementation for Linux is being +written which will allow X.25 to be run over an Ethernet (or Token Ring) and +conform with the JNT "Pink Book", this will have a different interface to +the Packet Layer but there will be no confusion since the class of device +being served by the LLC will be completely separate from LAPB. The LLC +implementation is being done as part of another protocol project (SNA) and +by a different author. + +Just when you thought that it could not become more confusing, another +option appeared, XOT. This allows X.25 Packet Layer frames to operate over +the Internet using TCP/IP as a reliable link layer. RFC1613 specifies the +format and behaviour of the protocol. If time permits this option will also +be actively considered. + +A linux-x25 mailing list has been created at vger.kernel.org to support the +development and use of Linux X.25. It is early days yet, but interested +parties are welcome to subscribe to it. Just send a message to +majordomo@vger.kernel.org with the following in the message body: + +subscribe linux-x25 +end + +The contents of the Subject line are ignored. + +Jonathan + +g4klx@g4klx.demon.co.uk diff --git a/Documentation/networking/x25.txt b/Documentation/networking/x25.txt deleted file mode 100644 index c91c6d7159ff..000000000000 --- a/Documentation/networking/x25.txt +++ /dev/null @@ -1,44 +0,0 @@ -Linux X.25 Project - -As my third year dissertation at University I have taken it upon myself to -write an X.25 implementation for Linux. My aim is to provide a complete X.25 -Packet Layer and a LAPB module to allow for "normal" X.25 to be run using -Linux. There are two sorts of X.25 cards available, intelligent ones that -implement LAPB on the card itself, and unintelligent ones that simply do -framing, bit-stuffing and checksumming. These both need to be handled by the -system. - -I therefore decided to write the implementation such that as far as the -Packet Layer is concerned, the link layer was being performed by a lower -layer of the Linux kernel and therefore it did not concern itself with -implementation of LAPB. Therefore the LAPB modules would be called by -unintelligent X.25 card drivers and not by intelligent ones, this would -provide a uniform device driver interface, and simplify configuration. - -To confuse matters a little, an 802.2 LLC implementation for Linux is being -written which will allow X.25 to be run over an Ethernet (or Token Ring) and -conform with the JNT "Pink Book", this will have a different interface to -the Packet Layer but there will be no confusion since the class of device -being served by the LLC will be completely separate from LAPB. The LLC -implementation is being done as part of another protocol project (SNA) and -by a different author. - -Just when you thought that it could not become more confusing, another -option appeared, XOT. This allows X.25 Packet Layer frames to operate over -the Internet using TCP/IP as a reliable link layer. RFC1613 specifies the -format and behaviour of the protocol. If time permits this option will also -be actively considered. - -A linux-x25 mailing list has been created at vger.kernel.org to support the -development and use of Linux X.25. It is early days yet, but interested -parties are welcome to subscribe to it. Just send a message to -majordomo@vger.kernel.org with the following in the message body: - -subscribe linux-x25 -end - -The contents of the Subject line are ignored. - -Jonathan - -g4klx@g4klx.demon.co.uk diff --git a/net/x25/Kconfig b/net/x25/Kconfig index a328f79885d1..9f0d58b0b90b 100644 --- a/net/x25/Kconfig +++ b/net/x25/Kconfig @@ -20,7 +20,7 @@ config X25 You can read more about X.25 at and . Information about X.25 for Linux is contained in the files - and + and . One connects to an X.25 network either with a dedicated network card -- cgit v1.2.3 From c4a0eb9350183d1d188793c534e4141bcf2ccea8 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:29 +0200 Subject: docs: networking: convert xfrm_device.txt to ReST - add SPDX header; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/xfrm_device.rst | 151 +++++++++++++++++++++++++++++++ Documentation/networking/xfrm_device.txt | 140 ---------------------------- 3 files changed, 152 insertions(+), 140 deletions(-) create mode 100644 Documentation/networking/xfrm_device.rst delete mode 100644 Documentation/networking/xfrm_device.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 75521e6c473b..e31f6cb564b4 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -117,6 +117,7 @@ Contents: vxlan x25-iface x25 + xfrm_device .. only:: subproject and html diff --git a/Documentation/networking/xfrm_device.rst b/Documentation/networking/xfrm_device.rst new file mode 100644 index 000000000000..da1073acda96 --- /dev/null +++ b/Documentation/networking/xfrm_device.rst @@ -0,0 +1,151 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============================================== +XFRM device - offloading the IPsec computations +=============================================== + +Shannon Nelson + + +Overview +======== + +IPsec is a useful feature for securing network traffic, but the +computational cost is high: a 10Gbps link can easily be brought down +to under 1Gbps, depending on the traffic and link configuration. +Luckily, there are NICs that offer a hardware based IPsec offload which +can radically increase throughput and decrease CPU utilization. The XFRM +Device interface allows NIC drivers to offer to the stack access to the +hardware offload. + +Userland access to the offload is typically through a system such as +libreswan or KAME/raccoon, but the iproute2 'ip xfrm' command set can +be handy when experimenting. An example command might look something +like this:: + + ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \ + reqid 0x07 replay-window 32 \ + aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \ + sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp \ + offload dev eth4 dir in + +Yes, that's ugly, but that's what shell scripts and/or libreswan are for. + + + +Callbacks to implement +====================== + +:: + + /* from include/linux/netdevice.h */ + struct xfrmdev_ops { + int (*xdo_dev_state_add) (struct xfrm_state *x); + void (*xdo_dev_state_delete) (struct xfrm_state *x); + void (*xdo_dev_state_free) (struct xfrm_state *x); + bool (*xdo_dev_offload_ok) (struct sk_buff *skb, + struct xfrm_state *x); + void (*xdo_dev_state_advance_esn) (struct xfrm_state *x); + }; + +The NIC driver offering ipsec offload will need to implement these +callbacks to make the offload available to the network stack's +XFRM subsytem. Additionally, the feature bits NETIF_F_HW_ESP and +NETIF_F_HW_ESP_TX_CSUM will signal the availability of the offload. + + + +Flow +==== + +At probe time and before the call to register_netdev(), the driver should +set up local data structures and XFRM callbacks, and set the feature bits. +The XFRM code's listener will finish the setup on NETDEV_REGISTER. + +:: + + adapter->netdev->xfrmdev_ops = &ixgbe_xfrmdev_ops; + adapter->netdev->features |= NETIF_F_HW_ESP; + adapter->netdev->hw_enc_features |= NETIF_F_HW_ESP; + +When new SAs are set up with a request for "offload" feature, the +driver's xdo_dev_state_add() will be given the new SA to be offloaded +and an indication of whether it is for Rx or Tx. The driver should + + - verify the algorithm is supported for offloads + - store the SA information (key, salt, target-ip, protocol, etc) + - enable the HW offload of the SA + - return status value: + + =========== =================================== + 0 success + -EOPNETSUPP offload not supported, try SW IPsec + other fail the request + =========== =================================== + +The driver can also set an offload_handle in the SA, an opaque void pointer +that can be used to convey context into the fast-path offload requests:: + + xs->xso.offload_handle = context; + + +When the network stack is preparing an IPsec packet for an SA that has +been setup for offload, it first calls into xdo_dev_offload_ok() with +the skb and the intended offload state to ask the driver if the offload +will serviceable. This can check the packet information to be sure the +offload can be supported (e.g. IPv4 or IPv6, no IPv4 options, etc) and +return true of false to signify its support. + +When ready to send, the driver needs to inspect the Tx packet for the +offload information, including the opaque context, and set up the packet +send accordingly:: + + xs = xfrm_input_state(skb); + context = xs->xso.offload_handle; + set up HW for send + +The stack has already inserted the appropriate IPsec headers in the +packet data, the offload just needs to do the encryption and fix up the +header values. + + +When a packet is received and the HW has indicated that it offloaded a +decryption, the driver needs to add a reference to the decoded SA into +the packet's skb. At this point the data should be decrypted but the +IPsec headers are still in the packet data; they are removed later up +the stack in xfrm_input(). + + find and hold the SA that was used to the Rx skb:: + + get spi, protocol, and destination IP from packet headers + xs = find xs from (spi, protocol, dest_IP) + xfrm_state_hold(xs); + + store the state information into the skb:: + + sp = secpath_set(skb); + if (!sp) return; + sp->xvec[sp->len++] = xs; + sp->olen++; + + indicate the success and/or error status of the offload:: + + xo = xfrm_offload(skb); + xo->flags = CRYPTO_DONE; + xo->status = crypto_status; + + hand the packet to napi_gro_receive() as usual + +In ESN mode, xdo_dev_state_advance_esn() is called from xfrm_replay_advance_esn(). +Driver will check packet seq number and update HW ESN state machine if needed. + +When the SA is removed by the user, the driver's xdo_dev_state_delete() +is asked to disable the offload. Later, xdo_dev_state_free() is called +from a garbage collection routine after all reference counts to the state +have been removed and any remaining resources can be cleared for the +offload state. How these are used by the driver will depend on specific +hardware needs. + +As a netdev is set to DOWN the XFRM stack's netdev listener will call +xdo_dev_state_delete() and xdo_dev_state_free() on any remaining offloaded +states. diff --git a/Documentation/networking/xfrm_device.txt b/Documentation/networking/xfrm_device.txt deleted file mode 100644 index a1c904dc70dc..000000000000 --- a/Documentation/networking/xfrm_device.txt +++ /dev/null @@ -1,140 +0,0 @@ - -=============================================== -XFRM device - offloading the IPsec computations -=============================================== -Shannon Nelson - - -Overview -======== - -IPsec is a useful feature for securing network traffic, but the -computational cost is high: a 10Gbps link can easily be brought down -to under 1Gbps, depending on the traffic and link configuration. -Luckily, there are NICs that offer a hardware based IPsec offload which -can radically increase throughput and decrease CPU utilization. The XFRM -Device interface allows NIC drivers to offer to the stack access to the -hardware offload. - -Userland access to the offload is typically through a system such as -libreswan or KAME/raccoon, but the iproute2 'ip xfrm' command set can -be handy when experimenting. An example command might look something -like this: - - ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \ - reqid 0x07 replay-window 32 \ - aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \ - sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp \ - offload dev eth4 dir in - -Yes, that's ugly, but that's what shell scripts and/or libreswan are for. - - - -Callbacks to implement -====================== - -/* from include/linux/netdevice.h */ -struct xfrmdev_ops { - int (*xdo_dev_state_add) (struct xfrm_state *x); - void (*xdo_dev_state_delete) (struct xfrm_state *x); - void (*xdo_dev_state_free) (struct xfrm_state *x); - bool (*xdo_dev_offload_ok) (struct sk_buff *skb, - struct xfrm_state *x); - void (*xdo_dev_state_advance_esn) (struct xfrm_state *x); -}; - -The NIC driver offering ipsec offload will need to implement these -callbacks to make the offload available to the network stack's -XFRM subsytem. Additionally, the feature bits NETIF_F_HW_ESP and -NETIF_F_HW_ESP_TX_CSUM will signal the availability of the offload. - - - -Flow -==== - -At probe time and before the call to register_netdev(), the driver should -set up local data structures and XFRM callbacks, and set the feature bits. -The XFRM code's listener will finish the setup on NETDEV_REGISTER. - - adapter->netdev->xfrmdev_ops = &ixgbe_xfrmdev_ops; - adapter->netdev->features |= NETIF_F_HW_ESP; - adapter->netdev->hw_enc_features |= NETIF_F_HW_ESP; - -When new SAs are set up with a request for "offload" feature, the -driver's xdo_dev_state_add() will be given the new SA to be offloaded -and an indication of whether it is for Rx or Tx. The driver should - - verify the algorithm is supported for offloads - - store the SA information (key, salt, target-ip, protocol, etc) - - enable the HW offload of the SA - - return status value: - 0 success - -EOPNETSUPP offload not supported, try SW IPsec - other fail the request - -The driver can also set an offload_handle in the SA, an opaque void pointer -that can be used to convey context into the fast-path offload requests. - - xs->xso.offload_handle = context; - - -When the network stack is preparing an IPsec packet for an SA that has -been setup for offload, it first calls into xdo_dev_offload_ok() with -the skb and the intended offload state to ask the driver if the offload -will serviceable. This can check the packet information to be sure the -offload can be supported (e.g. IPv4 or IPv6, no IPv4 options, etc) and -return true of false to signify its support. - -When ready to send, the driver needs to inspect the Tx packet for the -offload information, including the opaque context, and set up the packet -send accordingly. - - xs = xfrm_input_state(skb); - context = xs->xso.offload_handle; - set up HW for send - -The stack has already inserted the appropriate IPsec headers in the -packet data, the offload just needs to do the encryption and fix up the -header values. - - -When a packet is received and the HW has indicated that it offloaded a -decryption, the driver needs to add a reference to the decoded SA into -the packet's skb. At this point the data should be decrypted but the -IPsec headers are still in the packet data; they are removed later up -the stack in xfrm_input(). - - find and hold the SA that was used to the Rx skb - get spi, protocol, and destination IP from packet headers - xs = find xs from (spi, protocol, dest_IP) - xfrm_state_hold(xs); - - store the state information into the skb - sp = secpath_set(skb); - if (!sp) return; - sp->xvec[sp->len++] = xs; - sp->olen++; - - indicate the success and/or error status of the offload - xo = xfrm_offload(skb); - xo->flags = CRYPTO_DONE; - xo->status = crypto_status; - - hand the packet to napi_gro_receive() as usual - -In ESN mode, xdo_dev_state_advance_esn() is called from xfrm_replay_advance_esn(). -Driver will check packet seq number and update HW ESN state machine if needed. - -When the SA is removed by the user, the driver's xdo_dev_state_delete() -is asked to disable the offload. Later, xdo_dev_state_free() is called -from a garbage collection routine after all reference counts to the state -have been removed and any remaining resources can be cleared for the -offload state. How these are used by the driver will depend on specific -hardware needs. - -As a netdev is set to DOWN the XFRM stack's netdev listener will call -xdo_dev_state_delete() and xdo_dev_state_free() on any remaining offloaded -states. - - -- cgit v1.2.3 From da62baada5cc94037ef91ed0c414a930a3a06520 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:30 +0200 Subject: docs: networking: convert xfrm_proc.txt to ReST - add SPDX header; - adjust title markup; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/xfrm_proc.rst | 113 +++++++++++++++++++++++++++++++++ Documentation/networking/xfrm_proc.txt | 82 ------------------------ 3 files changed, 114 insertions(+), 82 deletions(-) create mode 100644 Documentation/networking/xfrm_proc.rst delete mode 100644 Documentation/networking/xfrm_proc.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index e31f6cb564b4..3fe70efb632e 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -118,6 +118,7 @@ Contents: x25-iface x25 xfrm_device + xfrm_proc .. only:: subproject and html diff --git a/Documentation/networking/xfrm_proc.rst b/Documentation/networking/xfrm_proc.rst new file mode 100644 index 000000000000..0a771c5a7399 --- /dev/null +++ b/Documentation/networking/xfrm_proc.rst @@ -0,0 +1,113 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================== +XFRM proc - /proc/net/xfrm_* files +================================== + +Masahide NAKAMURA + + +Transformation Statistics +------------------------- + +The xfrm_proc code is a set of statistics showing numbers of packets +dropped by the transformation code and why. These counters are defined +as part of the linux private MIB. These counters can be viewed in +/proc/net/xfrm_stat. + + +Inbound errors +~~~~~~~~~~~~~~ + +XfrmInError: + All errors which is not matched others + +XfrmInBufferError: + No buffer is left + +XfrmInHdrError: + Header error + +XfrmInNoStates: + No state is found + i.e. Either inbound SPI, address, or IPsec protocol at SA is wrong + +XfrmInStateProtoError: + Transformation protocol specific error + e.g. SA key is wrong + +XfrmInStateModeError: + Transformation mode specific error + +XfrmInStateSeqError: + Sequence error + i.e. Sequence number is out of window + +XfrmInStateExpired: + State is expired + +XfrmInStateMismatch: + State has mismatch option + e.g. UDP encapsulation type is mismatch + +XfrmInStateInvalid: + State is invalid + +XfrmInTmplMismatch: + No matching template for states + e.g. Inbound SAs are correct but SP rule is wrong + +XfrmInNoPols: + No policy is found for states + e.g. Inbound SAs are correct but no SP is found + +XfrmInPolBlock: + Policy discards + +XfrmInPolError: + Policy error + +XfrmAcquireError: + State hasn't been fully acquired before use + +XfrmFwdHdrError: + Forward routing of a packet is not allowed + +Outbound errors +~~~~~~~~~~~~~~~ +XfrmOutError: + All errors which is not matched others + +XfrmOutBundleGenError: + Bundle generation error + +XfrmOutBundleCheckError: + Bundle check error + +XfrmOutNoStates: + No state is found + +XfrmOutStateProtoError: + Transformation protocol specific error + +XfrmOutStateModeError: + Transformation mode specific error + +XfrmOutStateSeqError: + Sequence error + i.e. Sequence number overflow + +XfrmOutStateExpired: + State is expired + +XfrmOutPolBlock: + Policy discards + +XfrmOutPolDead: + Policy is dead + +XfrmOutPolError: + Policy error + +XfrmOutStateInvalid: + State is invalid, perhaps expired diff --git a/Documentation/networking/xfrm_proc.txt b/Documentation/networking/xfrm_proc.txt deleted file mode 100644 index 2eae619ab67b..000000000000 --- a/Documentation/networking/xfrm_proc.txt +++ /dev/null @@ -1,82 +0,0 @@ -XFRM proc - /proc/net/xfrm_* files -================================== -Masahide NAKAMURA - - -Transformation Statistics -------------------------- - -The xfrm_proc code is a set of statistics showing numbers of packets -dropped by the transformation code and why. These counters are defined -as part of the linux private MIB. These counters can be viewed in -/proc/net/xfrm_stat. - - -Inbound errors -~~~~~~~~~~~~~~ -XfrmInError: - All errors which is not matched others -XfrmInBufferError: - No buffer is left -XfrmInHdrError: - Header error -XfrmInNoStates: - No state is found - i.e. Either inbound SPI, address, or IPsec protocol at SA is wrong -XfrmInStateProtoError: - Transformation protocol specific error - e.g. SA key is wrong -XfrmInStateModeError: - Transformation mode specific error -XfrmInStateSeqError: - Sequence error - i.e. Sequence number is out of window -XfrmInStateExpired: - State is expired -XfrmInStateMismatch: - State has mismatch option - e.g. UDP encapsulation type is mismatch -XfrmInStateInvalid: - State is invalid -XfrmInTmplMismatch: - No matching template for states - e.g. Inbound SAs are correct but SP rule is wrong -XfrmInNoPols: - No policy is found for states - e.g. Inbound SAs are correct but no SP is found -XfrmInPolBlock: - Policy discards -XfrmInPolError: - Policy error -XfrmAcquireError: - State hasn't been fully acquired before use -XfrmFwdHdrError: - Forward routing of a packet is not allowed - -Outbound errors -~~~~~~~~~~~~~~~ -XfrmOutError: - All errors which is not matched others -XfrmOutBundleGenError: - Bundle generation error -XfrmOutBundleCheckError: - Bundle check error -XfrmOutNoStates: - No state is found -XfrmOutStateProtoError: - Transformation protocol specific error -XfrmOutStateModeError: - Transformation mode specific error -XfrmOutStateSeqError: - Sequence error - i.e. Sequence number overflow -XfrmOutStateExpired: - State is expired -XfrmOutPolBlock: - Policy discards -XfrmOutPolDead: - Policy is dead -XfrmOutPolError: - Policy error -XfrmOutStateInvalid: - State is invalid, perhaps expired -- cgit v1.2.3 From a5cfea33e5e54854fa541deb08b85b782f21bab5 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:31 +0200 Subject: docs: networking: convert xfrm_sync.txt to ReST - add SPDX header; - add a document title; - adjust titles and chapters, adding proper markups; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/xfrm_sync.rst | 189 +++++++++++++++++++++++++++++++++ Documentation/networking/xfrm_sync.txt | 169 ----------------------------- 3 files changed, 190 insertions(+), 169 deletions(-) create mode 100644 Documentation/networking/xfrm_sync.rst delete mode 100644 Documentation/networking/xfrm_sync.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 3fe70efb632e..ec83bd95e4e9 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -119,6 +119,7 @@ Contents: x25 xfrm_device xfrm_proc + xfrm_sync .. only:: subproject and html diff --git a/Documentation/networking/xfrm_sync.rst b/Documentation/networking/xfrm_sync.rst new file mode 100644 index 000000000000..6246503ceab2 --- /dev/null +++ b/Documentation/networking/xfrm_sync.rst @@ -0,0 +1,189 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==== +XFRM +==== + +The sync patches work is based on initial patches from +Krisztian and others and additional patches +from Jamal . + +The end goal for syncing is to be able to insert attributes + generate +events so that the SA can be safely moved from one machine to another +for HA purposes. +The idea is to synchronize the SA so that the takeover machine can do +the processing of the SA as accurate as possible if it has access to it. + +We already have the ability to generate SA add/del/upd events. +These patches add ability to sync and have accurate lifetime byte (to +ensure proper decay of SAs) and replay counters to avoid replay attacks +with as minimal loss at failover time. +This way a backup stays as closely up-to-date as an active member. + +Because the above items change for every packet the SA receives, +it is possible for a lot of the events to be generated. +For this reason, we also add a nagle-like algorithm to restrict +the events. i.e we are going to set thresholds to say "let me +know if the replay sequence threshold is reached or 10 secs have passed" +These thresholds are set system-wide via sysctls or can be updated +per SA. + +The identified items that need to be synchronized are: +- the lifetime byte counter +note that: lifetime time limit is not important if you assume the failover +machine is known ahead of time since the decay of the time countdown +is not driven by packet arrival. +- the replay sequence for both inbound and outbound + +1) Message Structure +---------------------- + +nlmsghdr:aevent_id:optional-TLVs. + +The netlink message types are: + +XFRM_MSG_NEWAE and XFRM_MSG_GETAE. + +A XFRM_MSG_GETAE does not have TLVs. + +A XFRM_MSG_NEWAE will have at least two TLVs (as is +discussed further below). + +aevent_id structure looks like:: + + struct xfrm_aevent_id { + struct xfrm_usersa_id sa_id; + xfrm_address_t saddr; + __u32 flags; + __u32 reqid; + }; + +The unique SA is identified by the combination of xfrm_usersa_id, +reqid and saddr. + +flags are used to indicate different things. The possible +flags are:: + + XFRM_AE_RTHR=1, /* replay threshold*/ + XFRM_AE_RVAL=2, /* replay value */ + XFRM_AE_LVAL=4, /* lifetime value */ + XFRM_AE_ETHR=8, /* expiry timer threshold */ + XFRM_AE_CR=16, /* Event cause is replay update */ + XFRM_AE_CE=32, /* Event cause is timer expiry */ + XFRM_AE_CU=64, /* Event cause is policy update */ + +How these flags are used is dependent on the direction of the +message (kernel<->user) as well the cause (config, query or event). +This is described below in the different messages. + +The pid will be set appropriately in netlink to recognize direction +(0 to the kernel and pid = processid that created the event +when going from kernel to user space) + +A program needs to subscribe to multicast group XFRMNLGRP_AEVENTS +to get notified of these events. + +2) TLVS reflect the different parameters: +----------------------------------------- + +a) byte value (XFRMA_LTIME_VAL) + +This TLV carries the running/current counter for byte lifetime since +last event. + +b)replay value (XFRMA_REPLAY_VAL) + +This TLV carries the running/current counter for replay sequence since +last event. + +c)replay threshold (XFRMA_REPLAY_THRESH) + +This TLV carries the threshold being used by the kernel to trigger events +when the replay sequence is exceeded. + +d) expiry timer (XFRMA_ETIMER_THRESH) + +This is a timer value in milliseconds which is used as the nagle +value to rate limit the events. + +3) Default configurations for the parameters: +--------------------------------------------- + +By default these events should be turned off unless there is +at least one listener registered to listen to the multicast +group XFRMNLGRP_AEVENTS. + +Programs installing SAs will need to specify the two thresholds, however, +in order to not change existing applications such as racoon +we also provide default threshold values for these different parameters +in case they are not specified. + +the two sysctls/proc entries are: + +a) /proc/sys/net/core/sysctl_xfrm_aevent_etime +used to provide default values for the XFRMA_ETIMER_THRESH in incremental +units of time of 100ms. The default is 10 (1 second) + +b) /proc/sys/net/core/sysctl_xfrm_aevent_rseqth +used to provide default values for XFRMA_REPLAY_THRESH parameter +in incremental packet count. The default is two packets. + +4) Message types +---------------- + +a) XFRM_MSG_GETAE issued by user-->kernel. + XFRM_MSG_GETAE does not carry any TLVs. + +The response is a XFRM_MSG_NEWAE which is formatted based on what +XFRM_MSG_GETAE queried for. + +The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs. +* if XFRM_AE_RTHR flag is set, then XFRMA_REPLAY_THRESH is also retrieved +* if XFRM_AE_ETHR flag is set, then XFRMA_ETIMER_THRESH is also retrieved + +b) XFRM_MSG_NEWAE is issued by either user space to configure + or kernel to announce events or respond to a XFRM_MSG_GETAE. + +i) user --> kernel to configure a specific SA. + +any of the values or threshold parameters can be updated by passing the +appropriate TLV. + +A response is issued back to the sender in user space to indicate success +or failure. + +In the case of success, additionally an event with +XFRM_MSG_NEWAE is also issued to any listeners as described in iii). + +ii) kernel->user direction as a response to XFRM_MSG_GETAE + +The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs. + +The threshold TLVs will be included if explicitly requested in +the XFRM_MSG_GETAE message. + +iii) kernel->user to report as event if someone sets any values or + thresholds for an SA using XFRM_MSG_NEWAE (as described in #i above). + In such a case XFRM_AE_CU flag is set to inform the user that + the change happened as a result of an update. + The message will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs. + +iv) kernel->user to report event when replay threshold or a timeout + is exceeded. + +In such a case either XFRM_AE_CR (replay exceeded) or XFRM_AE_CE (timeout +happened) is set to inform the user what happened. +Note the two flags are mutually exclusive. +The message will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs. + +Exceptions to threshold settings +-------------------------------- + +If you have an SA that is getting hit by traffic in bursts such that +there is a period where the timer threshold expires with no packets +seen, then an odd behavior is seen as follows: +The first packet arrival after a timer expiry will trigger a timeout +event; i.e we don't wait for a timeout period or a packet threshold +to be reached. This is done for simplicity and efficiency reasons. + +-JHS diff --git a/Documentation/networking/xfrm_sync.txt b/Documentation/networking/xfrm_sync.txt deleted file mode 100644 index 8d88e0f2ec49..000000000000 --- a/Documentation/networking/xfrm_sync.txt +++ /dev/null @@ -1,169 +0,0 @@ - -The sync patches work is based on initial patches from -Krisztian and others and additional patches -from Jamal . - -The end goal for syncing is to be able to insert attributes + generate -events so that the SA can be safely moved from one machine to another -for HA purposes. -The idea is to synchronize the SA so that the takeover machine can do -the processing of the SA as accurate as possible if it has access to it. - -We already have the ability to generate SA add/del/upd events. -These patches add ability to sync and have accurate lifetime byte (to -ensure proper decay of SAs) and replay counters to avoid replay attacks -with as minimal loss at failover time. -This way a backup stays as closely up-to-date as an active member. - -Because the above items change for every packet the SA receives, -it is possible for a lot of the events to be generated. -For this reason, we also add a nagle-like algorithm to restrict -the events. i.e we are going to set thresholds to say "let me -know if the replay sequence threshold is reached or 10 secs have passed" -These thresholds are set system-wide via sysctls or can be updated -per SA. - -The identified items that need to be synchronized are: -- the lifetime byte counter -note that: lifetime time limit is not important if you assume the failover -machine is known ahead of time since the decay of the time countdown -is not driven by packet arrival. -- the replay sequence for both inbound and outbound - -1) Message Structure ----------------------- - -nlmsghdr:aevent_id:optional-TLVs. - -The netlink message types are: - -XFRM_MSG_NEWAE and XFRM_MSG_GETAE. - -A XFRM_MSG_GETAE does not have TLVs. -A XFRM_MSG_NEWAE will have at least two TLVs (as is -discussed further below). - -aevent_id structure looks like: - - struct xfrm_aevent_id { - struct xfrm_usersa_id sa_id; - xfrm_address_t saddr; - __u32 flags; - __u32 reqid; - }; - -The unique SA is identified by the combination of xfrm_usersa_id, -reqid and saddr. - -flags are used to indicate different things. The possible -flags are: - XFRM_AE_RTHR=1, /* replay threshold*/ - XFRM_AE_RVAL=2, /* replay value */ - XFRM_AE_LVAL=4, /* lifetime value */ - XFRM_AE_ETHR=8, /* expiry timer threshold */ - XFRM_AE_CR=16, /* Event cause is replay update */ - XFRM_AE_CE=32, /* Event cause is timer expiry */ - XFRM_AE_CU=64, /* Event cause is policy update */ - -How these flags are used is dependent on the direction of the -message (kernel<->user) as well the cause (config, query or event). -This is described below in the different messages. - -The pid will be set appropriately in netlink to recognize direction -(0 to the kernel and pid = processid that created the event -when going from kernel to user space) - -A program needs to subscribe to multicast group XFRMNLGRP_AEVENTS -to get notified of these events. - -2) TLVS reflect the different parameters: ------------------------------------------ - -a) byte value (XFRMA_LTIME_VAL) -This TLV carries the running/current counter for byte lifetime since -last event. - -b)replay value (XFRMA_REPLAY_VAL) -This TLV carries the running/current counter for replay sequence since -last event. - -c)replay threshold (XFRMA_REPLAY_THRESH) -This TLV carries the threshold being used by the kernel to trigger events -when the replay sequence is exceeded. - -d) expiry timer (XFRMA_ETIMER_THRESH) -This is a timer value in milliseconds which is used as the nagle -value to rate limit the events. - -3) Default configurations for the parameters: ----------------------------------------------- - -By default these events should be turned off unless there is -at least one listener registered to listen to the multicast -group XFRMNLGRP_AEVENTS. - -Programs installing SAs will need to specify the two thresholds, however, -in order to not change existing applications such as racoon -we also provide default threshold values for these different parameters -in case they are not specified. - -the two sysctls/proc entries are: -a) /proc/sys/net/core/sysctl_xfrm_aevent_etime -used to provide default values for the XFRMA_ETIMER_THRESH in incremental -units of time of 100ms. The default is 10 (1 second) - -b) /proc/sys/net/core/sysctl_xfrm_aevent_rseqth -used to provide default values for XFRMA_REPLAY_THRESH parameter -in incremental packet count. The default is two packets. - -4) Message types ----------------- - -a) XFRM_MSG_GETAE issued by user-->kernel. -XFRM_MSG_GETAE does not carry any TLVs. -The response is a XFRM_MSG_NEWAE which is formatted based on what -XFRM_MSG_GETAE queried for. -The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs. -*if XFRM_AE_RTHR flag is set, then XFRMA_REPLAY_THRESH is also retrieved -*if XFRM_AE_ETHR flag is set, then XFRMA_ETIMER_THRESH is also retrieved - -b) XFRM_MSG_NEWAE is issued by either user space to configure -or kernel to announce events or respond to a XFRM_MSG_GETAE. - -i) user --> kernel to configure a specific SA. -any of the values or threshold parameters can be updated by passing the -appropriate TLV. -A response is issued back to the sender in user space to indicate success -or failure. -In the case of success, additionally an event with -XFRM_MSG_NEWAE is also issued to any listeners as described in iii). - -ii) kernel->user direction as a response to XFRM_MSG_GETAE -The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs. -The threshold TLVs will be included if explicitly requested in -the XFRM_MSG_GETAE message. - -iii) kernel->user to report as event if someone sets any values or -thresholds for an SA using XFRM_MSG_NEWAE (as described in #i above). -In such a case XFRM_AE_CU flag is set to inform the user that -the change happened as a result of an update. -The message will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs. - -iv) kernel->user to report event when replay threshold or a timeout -is exceeded. -In such a case either XFRM_AE_CR (replay exceeded) or XFRM_AE_CE (timeout -happened) is set to inform the user what happened. -Note the two flags are mutually exclusive. -The message will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs. - -Exceptions to threshold settings --------------------------------- - -If you have an SA that is getting hit by traffic in bursts such that -there is a period where the timer threshold expires with no packets -seen, then an odd behavior is seen as follows: -The first packet arrival after a timer expiry will trigger a timeout -event; i.e we don't wait for a timeout period or a packet threshold -to be reached. This is done for simplicity and efficiency reasons. - --JHS -- cgit v1.2.3 From a6c34b476ca27d0e5c14e58aefdbbdc4c509dd5f Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:32 +0200 Subject: docs: networking: convert xfrm_sysctl.txt to ReST Not much to be done here: - add SPDX header; - add a document title; - add a chapter's markup; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/xfrm_sysctl.rst | 11 +++++++++++ Documentation/networking/xfrm_sysctl.txt | 4 ---- 3 files changed, 12 insertions(+), 4 deletions(-) create mode 100644 Documentation/networking/xfrm_sysctl.rst delete mode 100644 Documentation/networking/xfrm_sysctl.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index ec83bd95e4e9..1630801cec19 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -120,6 +120,7 @@ Contents: xfrm_device xfrm_proc xfrm_sync + xfrm_sysctl .. only:: subproject and html diff --git a/Documentation/networking/xfrm_sysctl.rst b/Documentation/networking/xfrm_sysctl.rst new file mode 100644 index 000000000000..47b9bbdd0179 --- /dev/null +++ b/Documentation/networking/xfrm_sysctl.rst @@ -0,0 +1,11 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============ +XFRM Syscall +============ + +/proc/sys/net/core/xfrm_* Variables: +==================================== + +xfrm_acq_expires - INTEGER + default 30 - hard timeout in seconds for acquire requests diff --git a/Documentation/networking/xfrm_sysctl.txt b/Documentation/networking/xfrm_sysctl.txt deleted file mode 100644 index 5bbd16792fe1..000000000000 --- a/Documentation/networking/xfrm_sysctl.txt +++ /dev/null @@ -1,4 +0,0 @@ -/proc/sys/net/core/xfrm_* Variables: - -xfrm_acq_expires - INTEGER - default 30 - hard timeout in seconds for acquire requests -- cgit v1.2.3 From 0046db09d539523ef1470bcad2f2614cc3ef7ddf Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:33 +0200 Subject: docs: networking: convert z8530drv.txt to ReST - add SPDX header; - use copyright symbol; - adjust titles and chapters, adding proper markups; - mark tables as such; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/z8530drv.rst | 686 ++++++++++++++++++++++++++++++++++ Documentation/networking/z8530drv.txt | 657 -------------------------------- MAINTAINERS | 2 +- drivers/net/hamradio/Kconfig | 4 +- drivers/net/hamradio/scc.c | 2 +- 6 files changed, 691 insertions(+), 661 deletions(-) create mode 100644 Documentation/networking/z8530drv.rst delete mode 100644 Documentation/networking/z8530drv.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 1630801cec19..f5733ca4fbcb 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -121,6 +121,7 @@ Contents: xfrm_proc xfrm_sync xfrm_sysctl + z8530drv .. only:: subproject and html diff --git a/Documentation/networking/z8530drv.rst b/Documentation/networking/z8530drv.rst new file mode 100644 index 000000000000..d2942760f167 --- /dev/null +++ b/Documentation/networking/z8530drv.rst @@ -0,0 +1,686 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +========================================================= +SCC.C - Linux driver for Z8530 based HDLC cards for AX.25 +========================================================= + + +This is a subset of the documentation. To use this driver you MUST have the +full package from: + +Internet: + + 1. ftp://ftp.ccac.rwth-aachen.de/pub/jr/z8530drv-utils_3.0-3.tar.gz + + 2. ftp://ftp.pspt.fi/pub/ham/linux/ax25/z8530drv-utils_3.0-3.tar.gz + +Please note that the information in this document may be hopelessly outdated. +A new version of the documentation, along with links to other important +Linux Kernel AX.25 documentation and programs, is available on +http://yaina.de/jreuter + +Copyright |copy| 1993,2000 by Joerg Reuter DL1BKE + +portions Copyright |copy| 1993 Guido ten Dolle PE1NNZ + +for the complete copyright notice see >> Copying.Z8530DRV << + +1. Initialization of the driver +=============================== + +To use the driver, 3 steps must be performed: + + 1. if compiled as module: loading the module + 2. Setup of hardware, MODEM and KISS parameters with sccinit + 3. Attach each channel to the Linux kernel AX.25 with "ifconfig" + +Unlike the versions below 2.4 this driver is a real network device +driver. If you want to run xNOS instead of our fine kernel AX.25 +use a 2.x version (available from above sites) or read the +AX.25-HOWTO on how to emulate a KISS TNC on network device drivers. + + +1.1 Loading the module +====================== + +(If you're going to compile the driver as a part of the kernel image, + skip this chapter and continue with 1.2) + +Before you can use a module, you'll have to load it with:: + + insmod scc.o + +please read 'man insmod' that comes with module-init-tools. + +You should include the insmod in one of the /etc/rc.d/rc.* files, +and don't forget to insert a call of sccinit after that. It +will read your /etc/z8530drv.conf. + +1.2. /etc/z8530drv.conf +======================= + +To setup all parameters you must run /sbin/sccinit from one +of your rc.*-files. This has to be done BEFORE you can +"ifconfig" an interface. Sccinit reads the file /etc/z8530drv.conf +and sets the hardware, MODEM and KISS parameters. A sample file is +delivered with this package. Change it to your needs. + +The file itself consists of two main sections. + +1.2.1 configuration of hardware parameters +========================================== + +The hardware setup section defines the following parameters for each +Z8530:: + + chip 1 + data_a 0x300 # data port A + ctrl_a 0x304 # control port A + data_b 0x301 # data port B + ctrl_b 0x305 # control port B + irq 5 # IRQ No. 5 + pclock 4915200 # clock + board BAYCOM # hardware type + escc no # enhanced SCC chip? (8580/85180/85280) + vector 0 # latch for interrupt vector + special no # address of special function register + option 0 # option to set via sfr + + +chip + - this is just a delimiter to make sccinit a bit simpler to + program. A parameter has no effect. + +data_a + - the address of the data port A of this Z8530 (needed) +ctrl_a + - the address of the control port A (needed) +data_b + - the address of the data port B (needed) +ctrl_b + - the address of the control port B (needed) + +irq + - the used IRQ for this chip. Different chips can use different + IRQs or the same. If they share an interrupt, it needs to be + specified within one chip-definition only. + +pclock - the clock at the PCLK pin of the Z8530 (option, 4915200 is + default), measured in Hertz + +board + - the "type" of the board: + + ======================= ======== + SCC type value + ======================= ======== + PA0HZP SCC card PA0HZP + EAGLE card EAGLE + PC100 card PC100 + PRIMUS-PC (DG9BL) card PRIMUS + BayCom (U)SCC card BAYCOM + ======================= ======== + +escc + - if you want support for ESCC chips (8580, 85180, 85280), set + this to "yes" (option, defaults to "no") + +vector + - address of the vector latch (aka "intack port") for PA0HZP + cards. There can be only one vector latch for all chips! + (option, defaults to 0) + +special + - address of the special function register on several cards. + (option, defaults to 0) + +option - The value you write into that register (option, default is 0) + +You can specify up to four chips (8 channels). If this is not enough, +just change:: + + #define MAXSCC 4 + +to a higher value. + +Example for the BAYCOM USCC: +---------------------------- + +:: + + chip 1 + data_a 0x300 # data port A + ctrl_a 0x304 # control port A + data_b 0x301 # data port B + ctrl_b 0x305 # control port B + irq 5 # IRQ No. 5 (#) + board BAYCOM # hardware type (*) + # + # SCC chip 2 + # + chip 2 + data_a 0x302 + ctrl_a 0x306 + data_b 0x303 + ctrl_b 0x307 + board BAYCOM + +An example for a PA0HZP card: +----------------------------- + +:: + + chip 1 + data_a 0x153 + data_b 0x151 + ctrl_a 0x152 + ctrl_b 0x150 + irq 9 + pclock 4915200 + board PA0HZP + vector 0x168 + escc no + # + # + # + chip 2 + data_a 0x157 + data_b 0x155 + ctrl_a 0x156 + ctrl_b 0x154 + irq 9 + pclock 4915200 + board PA0HZP + vector 0x168 + escc no + +A DRSI would should probably work with this: +-------------------------------------------- +(actually: two DRSI cards...) + +:: + + chip 1 + data_a 0x303 + data_b 0x301 + ctrl_a 0x302 + ctrl_b 0x300 + irq 7 + pclock 4915200 + board DRSI + escc no + # + # + # + chip 2 + data_a 0x313 + data_b 0x311 + ctrl_a 0x312 + ctrl_b 0x310 + irq 7 + pclock 4915200 + board DRSI + escc no + +Note that you cannot use the on-board baudrate generator off DRSI +cards. Use "mode dpll" for clock source (see below). + +This is based on information provided by Mike Bilow (and verified +by Paul Helay) + +The utility "gencfg" +-------------------- + +If you only know the parameters for the PE1CHL driver for DOS, +run gencfg. It will generate the correct port addresses (I hope). +Its parameters are exactly the same as the ones you use with +the "attach scc" command in net, except that the string "init" must +not appear. Example:: + + gencfg 2 0x150 4 2 0 1 0x168 9 4915200 + +will print a skeleton z8530drv.conf for the OptoSCC to stdout. + +:: + + gencfg 2 0x300 2 4 5 -4 0 7 4915200 0x10 + +does the same for the BAYCOM USCC card. In my opinion it is much easier +to edit scc_config.h... + + +1.2.2 channel configuration +=========================== + +The channel definition is divided into three sub sections for each +channel: + +An example for scc0:: + + # DEVICE + + device scc0 # the device for the following params + + # MODEM / BUFFERS + + speed 1200 # the default baudrate + clock dpll # clock source: + # dpll = normal half duplex operation + # external = MODEM provides own Rx/Tx clock + # divider = use full duplex divider if + # installed (1) + mode nrzi # HDLC encoding mode + # nrzi = 1k2 MODEM, G3RUH 9k6 MODEM + # nrz = DF9IC 9k6 MODEM + # + bufsize 384 # size of buffers. Note that this must include + # the AX.25 header, not only the data field! + # (optional, defaults to 384) + + # KISS (Layer 1) + + txdelay 36 # (see chapter 1.4) + persist 64 + slot 8 + tail 8 + fulldup 0 + wait 12 + min 3 + maxkey 7 + idle 3 + maxdef 120 + group 0 + txoff off + softdcd on + slip off + +The order WITHIN these sections is unimportant. The order OF these +sections IS important. The MODEM parameters are set with the first +recognized KISS parameter... + +Please note that you can initialize the board only once after boot +(or insmod). You can change all parameters but "mode" and "clock" +later with the Sccparam program or through KISS. Just to avoid +security holes... + +(1) this divider is usually mounted on the SCC-PBC (PA0HZP) or not + present at all (BayCom). It feeds back the output of the DPLL + (digital pll) as transmit clock. Using this mode without a divider + installed will normally result in keying the transceiver until + maxkey expires --- of course without sending anything (useful). + +2. Attachment of a channel by your AX.25 software +================================================= + +2.1 Kernel AX.25 +================ + +To set up an AX.25 device you can simply type:: + + ifconfig scc0 44.128.1.1 hw ax25 dl0tha-7 + +This will create a network interface with the IP number 44.128.20.107 +and the callsign "dl0tha". If you do not have any IP number (yet) you +can use any of the 44.128.0.0 network. Note that you do not need +axattach. The purpose of axattach (like slattach) is to create a KISS +network device linked to a TTY. Please read the documentation of the +ax25-utils and the AX.25-HOWTO to learn how to set the parameters of +the kernel AX.25. + +2.2 NOS, NET and TFKISS +======================= + +Since the TTY driver (aka KISS TNC emulation) is gone you need +to emulate the old behaviour. The cost of using these programs is +that you probably need to compile the kernel AX.25, regardless of whether +you actually use it or not. First setup your /etc/ax25/axports, +for example:: + + 9k6 dl0tha-9 9600 255 4 9600 baud port (scc3) + axlink dl0tha-15 38400 255 4 Link to NOS + +Now "ifconfig" the scc device:: + + ifconfig scc3 44.128.1.1 hw ax25 dl0tha-9 + +You can now axattach a pseudo-TTY:: + + axattach /dev/ptys0 axlink + +and start your NOS and attach /dev/ptys0 there. The problem is that +NOS is reachable only via digipeating through the kernel AX.25 +(disastrous on a DAMA controlled channel). To solve this problem, +configure "rxecho" to echo the incoming frames from "9k6" to "axlink" +and outgoing frames from "axlink" to "9k6" and start:: + + rxecho + +Or simply use "kissbridge" coming with z8530drv-utils:: + + ifconfig scc3 hw ax25 dl0tha-9 + kissbridge scc3 /dev/ptys0 + + +3. Adjustment and Display of parameters +======================================= + +3.1 Displaying SCC Parameters: +============================== + +Once a SCC channel has been attached, the parameter settings and +some statistic information can be shown using the param program:: + + dl1bke-u:~$ sccstat scc0 + + Parameters: + + speed : 1200 baud + txdelay : 36 + persist : 255 + slottime : 0 + txtail : 8 + fulldup : 1 + waittime : 12 + mintime : 3 sec + maxkeyup : 7 sec + idletime : 3 sec + maxdefer : 120 sec + group : 0x00 + txoff : off + softdcd : on + SLIP : off + + Status: + + HDLC Z8530 Interrupts Buffers + ----------------------------------------------------------------------- + Sent : 273 RxOver : 0 RxInts : 125074 Size : 384 + Received : 1095 TxUnder: 0 TxInts : 4684 NoSpace : 0 + RxErrors : 1591 ExInts : 11776 + TxErrors : 0 SpInts : 1503 + Tx State : idle + + +The status info shown is: + +============== ============================================================== +Sent number of frames transmitted +Received number of frames received +RxErrors number of receive errors (CRC, ABORT) +TxErrors number of discarded Tx frames (due to various reasons) +Tx State status of the Tx interrupt handler: idle/busy/active/tail (2) +RxOver number of receiver overruns +TxUnder number of transmitter underruns +RxInts number of receiver interrupts +TxInts number of transmitter interrupts +EpInts number of receiver special condition interrupts +SpInts number of external/status interrupts +Size maximum size of an AX.25 frame (*with* AX.25 headers!) +NoSpace number of times a buffer could not get allocated +============== ============================================================== + +An overrun is abnormal. If lots of these occur, the product of +baudrate and number of interfaces is too high for the processing +power of your computer. NoSpace errors are unlikely to be caused by the +driver or the kernel AX.25. + + +3.2 Setting Parameters +====================== + + +The setting of parameters of the emulated KISS TNC is done in the +same way in the SCC driver. You can change parameters by using +the kissparms program from the ax25-utils package or use the program +"sccparam":: + + sccparam + +You can change the following parameters: + +=========== ===== +param value +=========== ===== +speed 1200 +txdelay 36 +persist 255 +slottime 0 +txtail 8 +fulldup 1 +waittime 12 +mintime 3 +maxkeyup 7 +idletime 3 +maxdefer 120 +group 0x00 +txoff off +softdcd on +SLIP off +=========== ===== + + +The parameters have the following meaning: + +speed: + The baudrate on this channel in bits/sec + + Example: sccparam /dev/scc3 speed 9600 + +txdelay: + The delay (in units of 10 ms) after keying of the + transmitter, until the first byte is sent. This is usually + called "TXDELAY" in a TNC. When 0 is specified, the driver + will just wait until the CTS signal is asserted. This + assumes the presence of a timer or other circuitry in the + MODEM and/or transmitter, that asserts CTS when the + transmitter is ready for data. + A normal value of this parameter is 30-36. + + Example: sccparam /dev/scc0 txd 20 + +persist: + This is the probability that the transmitter will be keyed + when the channel is found to be free. It is a value from 0 + to 255, and the probability is (value+1)/256. The value + should be somewhere near 50-60, and should be lowered when + the channel is used more heavily. + + Example: sccparam /dev/scc2 persist 20 + +slottime: + This is the time between samples of the channel. It is + expressed in units of 10 ms. About 200-300 ms (value 20-30) + seems to be a good value. + + Example: sccparam /dev/scc0 slot 20 + +tail: + The time the transmitter will remain keyed after the last + byte of a packet has been transferred to the SCC. This is + necessary because the CRC and a flag still have to leave the + SCC before the transmitter is keyed down. The value depends + on the baudrate selected. A few character times should be + sufficient, e.g. 40ms at 1200 baud. (value 4) + The value of this parameter is in 10 ms units. + + Example: sccparam /dev/scc2 4 + +full: + The full-duplex mode switch. This can be one of the following + values: + + 0: The interface will operate in CSMA mode (the normal + half-duplex packet radio operation) + 1: Fullduplex mode, i.e. the transmitter will be keyed at + any time, without checking the received carrier. It + will be unkeyed when there are no packets to be sent. + 2: Like 1, but the transmitter will remain keyed, also + when there are no packets to be sent. Flags will be + sent in that case, until a timeout (parameter 10) + occurs. + + Example: sccparam /dev/scc0 fulldup off + +wait: + The initial waittime before any transmit attempt, after the + frame has been queue for transmit. This is the length of + the first slot in CSMA mode. In full duplex modes it is + set to 0 for maximum performance. + The value of this parameter is in 10 ms units. + + Example: sccparam /dev/scc1 wait 4 + +maxkey: + The maximal time the transmitter will be keyed to send + packets, in seconds. This can be useful on busy CSMA + channels, to avoid "getting a bad reputation" when you are + generating a lot of traffic. After the specified time has + elapsed, no new frame will be started. Instead, the trans- + mitter will be switched off for a specified time (parameter + min), and then the selected algorithm for keyup will be + started again. + The value 0 as well as "off" will disable this feature, + and allow infinite transmission time. + + Example: sccparam /dev/scc0 maxk 20 + +min: + This is the time the transmitter will be switched off when + the maximum transmission time is exceeded. + + Example: sccparam /dev/scc3 min 10 + +idle: + This parameter specifies the maximum idle time in full duplex + 2 mode, in seconds. When no frames have been sent for this + time, the transmitter will be keyed down. A value of 0 is + has same result as the fullduplex mode 1. This parameter + can be disabled. + + Example: sccparam /dev/scc2 idle off # transmit forever + +maxdefer + This is the maximum time (in seconds) to wait for a free channel + to send. When this timer expires the transmitter will be keyed + IMMEDIATELY. If you love to get trouble with other users you + should set this to a very low value ;-) + + Example: sccparam /dev/scc0 maxdefer 240 # 2 minutes + + +txoff: + When this parameter has the value 0, the transmission of packets + is enable. Otherwise it is disabled. + + Example: sccparam /dev/scc2 txoff on + +group: + It is possible to build special radio equipment to use more than + one frequency on the same band, e.g. using several receivers and + only one transmitter that can be switched between frequencies. + Also, you can connect several radios that are active on the same + band. In these cases, it is not possible, or not a good idea, to + transmit on more than one frequency. The SCC driver provides a + method to lock transmitters on different interfaces, using the + "param group " command. This will only work when + you are using CSMA mode (parameter full = 0). + + The number must be 0 if you want no group restrictions, and + can be computed as follows to create restricted groups: + is the sum of some OCTAL numbers: + + + === ======================================================= + 200 This transmitter will only be keyed when all other + transmitters in the group are off. + 100 This transmitter will only be keyed when the carrier + detect of all other interfaces in the group is off. + 0xx A byte that can be used to define different groups. + Interfaces are in the same group, when the logical AND + between their xx values is nonzero. + === ======================================================= + + Examples: + + When 2 interfaces use group 201, their transmitters will never be + keyed at the same time. + + When 2 interfaces use group 101, the transmitters will only key + when both channels are clear at the same time. When group 301, + the transmitters will not be keyed at the same time. + + Don't forget to convert the octal numbers into decimal before + you set the parameter. + + Example: (to be written) + +softdcd: + use a software dcd instead of the real one... Useful for a very + slow squelch. + + Example: sccparam /dev/scc0 soft on + + +4. Problems +=========== + +If you have tx-problems with your BayCom USCC card please check +the manufacturer of the 8530. SGS chips have a slightly +different timing. Try Zilog... A solution is to write to register 8 +instead to the data port, but this won't work with the ESCC chips. +*SIGH!* + +A very common problem is that the PTT locks until the maxkeyup timer +expires, although interrupts and clock source are correct. In most +cases compiling the driver with CONFIG_SCC_DELAY (set with +make config) solves the problems. For more hints read the (pseudo) FAQ +and the documentation coming with z8530drv-utils. + +I got reports that the driver has problems on some 386-based systems. +(i.e. Amstrad) Those systems have a bogus AT bus timing which will +lead to delayed answers on interrupts. You can recognize these +problems by looking at the output of Sccstat for the suspected +port. If it shows under- and overruns you own such a system. + +Delayed processing of received data: This depends on + +- the kernel version + +- kernel profiling compiled or not + +- a high interrupt load + +- a high load of the machine --- running X, Xmorph, XV and Povray, + while compiling the kernel... hmm ... even with 32 MB RAM ... ;-) + Or running a named for the whole .ampr.org domain on an 8 MB + box... + +- using information from rxecho or kissbridge. + +Kernel panics: please read /linux/README and find out if it +really occurred within the scc driver. + +If you cannot solve a problem, send me + +- a description of the problem, +- information on your hardware (computer system, scc board, modem) +- your kernel version +- the output of cat /proc/net/z8530 + +4. Thor RLC100 +============== + +Mysteriously this board seems not to work with the driver. Anyone +got it up-and-running? + + +Many thanks to Linus Torvalds and Alan Cox for including the driver +in the Linux standard distribution and their support. + +:: + + Joerg Reuter ampr-net: dl1bke@db0pra.ampr.org + AX-25 : DL1BKE @ DB0ABH.#BAY.DEU.EU + Internet: jreuter@yaina.de + WWW : http://yaina.de/jreuter diff --git a/Documentation/networking/z8530drv.txt b/Documentation/networking/z8530drv.txt deleted file mode 100644 index 2206abbc3e1b..000000000000 --- a/Documentation/networking/z8530drv.txt +++ /dev/null @@ -1,657 +0,0 @@ -This is a subset of the documentation. To use this driver you MUST have the -full package from: - -Internet: -========= - -1. ftp://ftp.ccac.rwth-aachen.de/pub/jr/z8530drv-utils_3.0-3.tar.gz - -2. ftp://ftp.pspt.fi/pub/ham/linux/ax25/z8530drv-utils_3.0-3.tar.gz - -Please note that the information in this document may be hopelessly outdated. -A new version of the documentation, along with links to other important -Linux Kernel AX.25 documentation and programs, is available on -http://yaina.de/jreuter - ------------------------------------------------------------------------------ - - - SCC.C - Linux driver for Z8530 based HDLC cards for AX.25 - - ******************************************************************** - - (c) 1993,2000 by Joerg Reuter DL1BKE - - portions (c) 1993 Guido ten Dolle PE1NNZ - - for the complete copyright notice see >> Copying.Z8530DRV << - - ******************************************************************** - - -1. Initialization of the driver -=============================== - -To use the driver, 3 steps must be performed: - - 1. if compiled as module: loading the module - 2. Setup of hardware, MODEM and KISS parameters with sccinit - 3. Attach each channel to the Linux kernel AX.25 with "ifconfig" - -Unlike the versions below 2.4 this driver is a real network device -driver. If you want to run xNOS instead of our fine kernel AX.25 -use a 2.x version (available from above sites) or read the -AX.25-HOWTO on how to emulate a KISS TNC on network device drivers. - - -1.1 Loading the module -====================== - -(If you're going to compile the driver as a part of the kernel image, - skip this chapter and continue with 1.2) - -Before you can use a module, you'll have to load it with - - insmod scc.o - -please read 'man insmod' that comes with module-init-tools. - -You should include the insmod in one of the /etc/rc.d/rc.* files, -and don't forget to insert a call of sccinit after that. It -will read your /etc/z8530drv.conf. - -1.2. /etc/z8530drv.conf -======================= - -To setup all parameters you must run /sbin/sccinit from one -of your rc.*-files. This has to be done BEFORE you can -"ifconfig" an interface. Sccinit reads the file /etc/z8530drv.conf -and sets the hardware, MODEM and KISS parameters. A sample file is -delivered with this package. Change it to your needs. - -The file itself consists of two main sections. - -1.2.1 configuration of hardware parameters -========================================== - -The hardware setup section defines the following parameters for each -Z8530: - -chip 1 -data_a 0x300 # data port A -ctrl_a 0x304 # control port A -data_b 0x301 # data port B -ctrl_b 0x305 # control port B -irq 5 # IRQ No. 5 -pclock 4915200 # clock -board BAYCOM # hardware type -escc no # enhanced SCC chip? (8580/85180/85280) -vector 0 # latch for interrupt vector -special no # address of special function register -option 0 # option to set via sfr - - -chip - this is just a delimiter to make sccinit a bit simpler to - program. A parameter has no effect. - -data_a - the address of the data port A of this Z8530 (needed) -ctrl_a - the address of the control port A (needed) -data_b - the address of the data port B (needed) -ctrl_b - the address of the control port B (needed) - -irq - the used IRQ for this chip. Different chips can use different - IRQs or the same. If they share an interrupt, it needs to be - specified within one chip-definition only. - -pclock - the clock at the PCLK pin of the Z8530 (option, 4915200 is - default), measured in Hertz - -board - the "type" of the board: - - SCC type value - --------------------------------- - PA0HZP SCC card PA0HZP - EAGLE card EAGLE - PC100 card PC100 - PRIMUS-PC (DG9BL) card PRIMUS - BayCom (U)SCC card BAYCOM - -escc - if you want support for ESCC chips (8580, 85180, 85280), set - this to "yes" (option, defaults to "no") - -vector - address of the vector latch (aka "intack port") for PA0HZP - cards. There can be only one vector latch for all chips! - (option, defaults to 0) - -special - address of the special function register on several cards. - (option, defaults to 0) - -option - The value you write into that register (option, default is 0) - -You can specify up to four chips (8 channels). If this is not enough, -just change - - #define MAXSCC 4 - -to a higher value. - -Example for the BAYCOM USCC: ----------------------------- - -chip 1 -data_a 0x300 # data port A -ctrl_a 0x304 # control port A -data_b 0x301 # data port B -ctrl_b 0x305 # control port B -irq 5 # IRQ No. 5 (#) -board BAYCOM # hardware type (*) -# -# SCC chip 2 -# -chip 2 -data_a 0x302 -ctrl_a 0x306 -data_b 0x303 -ctrl_b 0x307 -board BAYCOM - -An example for a PA0HZP card: ------------------------------ - -chip 1 -data_a 0x153 -data_b 0x151 -ctrl_a 0x152 -ctrl_b 0x150 -irq 9 -pclock 4915200 -board PA0HZP -vector 0x168 -escc no -# -# -# -chip 2 -data_a 0x157 -data_b 0x155 -ctrl_a 0x156 -ctrl_b 0x154 -irq 9 -pclock 4915200 -board PA0HZP -vector 0x168 -escc no - -A DRSI would should probably work with this: --------------------------------------------- -(actually: two DRSI cards...) - -chip 1 -data_a 0x303 -data_b 0x301 -ctrl_a 0x302 -ctrl_b 0x300 -irq 7 -pclock 4915200 -board DRSI -escc no -# -# -# -chip 2 -data_a 0x313 -data_b 0x311 -ctrl_a 0x312 -ctrl_b 0x310 -irq 7 -pclock 4915200 -board DRSI -escc no - -Note that you cannot use the on-board baudrate generator off DRSI -cards. Use "mode dpll" for clock source (see below). - -This is based on information provided by Mike Bilow (and verified -by Paul Helay) - -The utility "gencfg" --------------------- - -If you only know the parameters for the PE1CHL driver for DOS, -run gencfg. It will generate the correct port addresses (I hope). -Its parameters are exactly the same as the ones you use with -the "attach scc" command in net, except that the string "init" must -not appear. Example: - -gencfg 2 0x150 4 2 0 1 0x168 9 4915200 - -will print a skeleton z8530drv.conf for the OptoSCC to stdout. - -gencfg 2 0x300 2 4 5 -4 0 7 4915200 0x10 - -does the same for the BAYCOM USCC card. In my opinion it is much easier -to edit scc_config.h... - - -1.2.2 channel configuration -=========================== - -The channel definition is divided into three sub sections for each -channel: - -An example for scc0: - -# DEVICE - -device scc0 # the device for the following params - -# MODEM / BUFFERS - -speed 1200 # the default baudrate -clock dpll # clock source: - # dpll = normal half duplex operation - # external = MODEM provides own Rx/Tx clock - # divider = use full duplex divider if - # installed (1) -mode nrzi # HDLC encoding mode - # nrzi = 1k2 MODEM, G3RUH 9k6 MODEM - # nrz = DF9IC 9k6 MODEM - # -bufsize 384 # size of buffers. Note that this must include - # the AX.25 header, not only the data field! - # (optional, defaults to 384) - -# KISS (Layer 1) - -txdelay 36 # (see chapter 1.4) -persist 64 -slot 8 -tail 8 -fulldup 0 -wait 12 -min 3 -maxkey 7 -idle 3 -maxdef 120 -group 0 -txoff off -softdcd on -slip off - -The order WITHIN these sections is unimportant. The order OF these -sections IS important. The MODEM parameters are set with the first -recognized KISS parameter... - -Please note that you can initialize the board only once after boot -(or insmod). You can change all parameters but "mode" and "clock" -later with the Sccparam program or through KISS. Just to avoid -security holes... - -(1) this divider is usually mounted on the SCC-PBC (PA0HZP) or not - present at all (BayCom). It feeds back the output of the DPLL - (digital pll) as transmit clock. Using this mode without a divider - installed will normally result in keying the transceiver until - maxkey expires --- of course without sending anything (useful). - -2. Attachment of a channel by your AX.25 software -================================================= - -2.1 Kernel AX.25 -================ - -To set up an AX.25 device you can simply type: - - ifconfig scc0 44.128.1.1 hw ax25 dl0tha-7 - -This will create a network interface with the IP number 44.128.20.107 -and the callsign "dl0tha". If you do not have any IP number (yet) you -can use any of the 44.128.0.0 network. Note that you do not need -axattach. The purpose of axattach (like slattach) is to create a KISS -network device linked to a TTY. Please read the documentation of the -ax25-utils and the AX.25-HOWTO to learn how to set the parameters of -the kernel AX.25. - -2.2 NOS, NET and TFKISS -======================= - -Since the TTY driver (aka KISS TNC emulation) is gone you need -to emulate the old behaviour. The cost of using these programs is -that you probably need to compile the kernel AX.25, regardless of whether -you actually use it or not. First setup your /etc/ax25/axports, -for example: - - 9k6 dl0tha-9 9600 255 4 9600 baud port (scc3) - axlink dl0tha-15 38400 255 4 Link to NOS - -Now "ifconfig" the scc device: - - ifconfig scc3 44.128.1.1 hw ax25 dl0tha-9 - -You can now axattach a pseudo-TTY: - - axattach /dev/ptys0 axlink - -and start your NOS and attach /dev/ptys0 there. The problem is that -NOS is reachable only via digipeating through the kernel AX.25 -(disastrous on a DAMA controlled channel). To solve this problem, -configure "rxecho" to echo the incoming frames from "9k6" to "axlink" -and outgoing frames from "axlink" to "9k6" and start: - - rxecho - -Or simply use "kissbridge" coming with z8530drv-utils: - - ifconfig scc3 hw ax25 dl0tha-9 - kissbridge scc3 /dev/ptys0 - - -3. Adjustment and Display of parameters -======================================= - -3.1 Displaying SCC Parameters: -============================== - -Once a SCC channel has been attached, the parameter settings and -some statistic information can be shown using the param program: - -dl1bke-u:~$ sccstat scc0 - -Parameters: - -speed : 1200 baud -txdelay : 36 -persist : 255 -slottime : 0 -txtail : 8 -fulldup : 1 -waittime : 12 -mintime : 3 sec -maxkeyup : 7 sec -idletime : 3 sec -maxdefer : 120 sec -group : 0x00 -txoff : off -softdcd : on -SLIP : off - -Status: - -HDLC Z8530 Interrupts Buffers ------------------------------------------------------------------------ -Sent : 273 RxOver : 0 RxInts : 125074 Size : 384 -Received : 1095 TxUnder: 0 TxInts : 4684 NoSpace : 0 -RxErrors : 1591 ExInts : 11776 -TxErrors : 0 SpInts : 1503 -Tx State : idle - - -The status info shown is: - -Sent - number of frames transmitted -Received - number of frames received -RxErrors - number of receive errors (CRC, ABORT) -TxErrors - number of discarded Tx frames (due to various reasons) -Tx State - status of the Tx interrupt handler: idle/busy/active/tail (2) -RxOver - number of receiver overruns -TxUnder - number of transmitter underruns -RxInts - number of receiver interrupts -TxInts - number of transmitter interrupts -EpInts - number of receiver special condition interrupts -SpInts - number of external/status interrupts -Size - maximum size of an AX.25 frame (*with* AX.25 headers!) -NoSpace - number of times a buffer could not get allocated - -An overrun is abnormal. If lots of these occur, the product of -baudrate and number of interfaces is too high for the processing -power of your computer. NoSpace errors are unlikely to be caused by the -driver or the kernel AX.25. - - -3.2 Setting Parameters -====================== - - -The setting of parameters of the emulated KISS TNC is done in the -same way in the SCC driver. You can change parameters by using -the kissparms program from the ax25-utils package or use the program -"sccparam": - - sccparam - -You can change the following parameters: - -param : value ------------------------- -speed : 1200 -txdelay : 36 -persist : 255 -slottime : 0 -txtail : 8 -fulldup : 1 -waittime : 12 -mintime : 3 -maxkeyup : 7 -idletime : 3 -maxdefer : 120 -group : 0x00 -txoff : off -softdcd : on -SLIP : off - - -The parameters have the following meaning: - -speed: - The baudrate on this channel in bits/sec - - Example: sccparam /dev/scc3 speed 9600 - -txdelay: - The delay (in units of 10 ms) after keying of the - transmitter, until the first byte is sent. This is usually - called "TXDELAY" in a TNC. When 0 is specified, the driver - will just wait until the CTS signal is asserted. This - assumes the presence of a timer or other circuitry in the - MODEM and/or transmitter, that asserts CTS when the - transmitter is ready for data. - A normal value of this parameter is 30-36. - - Example: sccparam /dev/scc0 txd 20 - -persist: - This is the probability that the transmitter will be keyed - when the channel is found to be free. It is a value from 0 - to 255, and the probability is (value+1)/256. The value - should be somewhere near 50-60, and should be lowered when - the channel is used more heavily. - - Example: sccparam /dev/scc2 persist 20 - -slottime: - This is the time between samples of the channel. It is - expressed in units of 10 ms. About 200-300 ms (value 20-30) - seems to be a good value. - - Example: sccparam /dev/scc0 slot 20 - -tail: - The time the transmitter will remain keyed after the last - byte of a packet has been transferred to the SCC. This is - necessary because the CRC and a flag still have to leave the - SCC before the transmitter is keyed down. The value depends - on the baudrate selected. A few character times should be - sufficient, e.g. 40ms at 1200 baud. (value 4) - The value of this parameter is in 10 ms units. - - Example: sccparam /dev/scc2 4 - -full: - The full-duplex mode switch. This can be one of the following - values: - - 0: The interface will operate in CSMA mode (the normal - half-duplex packet radio operation) - 1: Fullduplex mode, i.e. the transmitter will be keyed at - any time, without checking the received carrier. It - will be unkeyed when there are no packets to be sent. - 2: Like 1, but the transmitter will remain keyed, also - when there are no packets to be sent. Flags will be - sent in that case, until a timeout (parameter 10) - occurs. - - Example: sccparam /dev/scc0 fulldup off - -wait: - The initial waittime before any transmit attempt, after the - frame has been queue for transmit. This is the length of - the first slot in CSMA mode. In full duplex modes it is - set to 0 for maximum performance. - The value of this parameter is in 10 ms units. - - Example: sccparam /dev/scc1 wait 4 - -maxkey: - The maximal time the transmitter will be keyed to send - packets, in seconds. This can be useful on busy CSMA - channels, to avoid "getting a bad reputation" when you are - generating a lot of traffic. After the specified time has - elapsed, no new frame will be started. Instead, the trans- - mitter will be switched off for a specified time (parameter - min), and then the selected algorithm for keyup will be - started again. - The value 0 as well as "off" will disable this feature, - and allow infinite transmission time. - - Example: sccparam /dev/scc0 maxk 20 - -min: - This is the time the transmitter will be switched off when - the maximum transmission time is exceeded. - - Example: sccparam /dev/scc3 min 10 - -idle - This parameter specifies the maximum idle time in full duplex - 2 mode, in seconds. When no frames have been sent for this - time, the transmitter will be keyed down. A value of 0 is - has same result as the fullduplex mode 1. This parameter - can be disabled. - - Example: sccparam /dev/scc2 idle off # transmit forever - -maxdefer - This is the maximum time (in seconds) to wait for a free channel - to send. When this timer expires the transmitter will be keyed - IMMEDIATELY. If you love to get trouble with other users you - should set this to a very low value ;-) - - Example: sccparam /dev/scc0 maxdefer 240 # 2 minutes - - -txoff: - When this parameter has the value 0, the transmission of packets - is enable. Otherwise it is disabled. - - Example: sccparam /dev/scc2 txoff on - -group: - It is possible to build special radio equipment to use more than - one frequency on the same band, e.g. using several receivers and - only one transmitter that can be switched between frequencies. - Also, you can connect several radios that are active on the same - band. In these cases, it is not possible, or not a good idea, to - transmit on more than one frequency. The SCC driver provides a - method to lock transmitters on different interfaces, using the - "param group " command. This will only work when - you are using CSMA mode (parameter full = 0). - The number must be 0 if you want no group restrictions, and - can be computed as follows to create restricted groups: - is the sum of some OCTAL numbers: - - 200 This transmitter will only be keyed when all other - transmitters in the group are off. - 100 This transmitter will only be keyed when the carrier - detect of all other interfaces in the group is off. - 0xx A byte that can be used to define different groups. - Interfaces are in the same group, when the logical AND - between their xx values is nonzero. - - Examples: - When 2 interfaces use group 201, their transmitters will never be - keyed at the same time. - When 2 interfaces use group 101, the transmitters will only key - when both channels are clear at the same time. When group 301, - the transmitters will not be keyed at the same time. - - Don't forget to convert the octal numbers into decimal before - you set the parameter. - - Example: (to be written) - -softdcd: - use a software dcd instead of the real one... Useful for a very - slow squelch. - - Example: sccparam /dev/scc0 soft on - - -4. Problems -=========== - -If you have tx-problems with your BayCom USCC card please check -the manufacturer of the 8530. SGS chips have a slightly -different timing. Try Zilog... A solution is to write to register 8 -instead to the data port, but this won't work with the ESCC chips. -*SIGH!* - -A very common problem is that the PTT locks until the maxkeyup timer -expires, although interrupts and clock source are correct. In most -cases compiling the driver with CONFIG_SCC_DELAY (set with -make config) solves the problems. For more hints read the (pseudo) FAQ -and the documentation coming with z8530drv-utils. - -I got reports that the driver has problems on some 386-based systems. -(i.e. Amstrad) Those systems have a bogus AT bus timing which will -lead to delayed answers on interrupts. You can recognize these -problems by looking at the output of Sccstat for the suspected -port. If it shows under- and overruns you own such a system. - -Delayed processing of received data: This depends on - -- the kernel version - -- kernel profiling compiled or not - -- a high interrupt load - -- a high load of the machine --- running X, Xmorph, XV and Povray, - while compiling the kernel... hmm ... even with 32 MB RAM ... ;-) - Or running a named for the whole .ampr.org domain on an 8 MB - box... - -- using information from rxecho or kissbridge. - -Kernel panics: please read /linux/README and find out if it -really occurred within the scc driver. - -If you cannot solve a problem, send me - -- a description of the problem, -- information on your hardware (computer system, scc board, modem) -- your kernel version -- the output of cat /proc/net/z8530 - -4. Thor RLC100 -============== - -Mysteriously this board seems not to work with the driver. Anyone -got it up-and-running? - - -Many thanks to Linus Torvalds and Alan Cox for including the driver -in the Linux standard distribution and their support. - -Joerg Reuter ampr-net: dl1bke@db0pra.ampr.org - AX-25 : DL1BKE @ DB0ABH.#BAY.DEU.EU - Internet: jreuter@yaina.de - WWW : http://yaina.de/jreuter diff --git a/MAINTAINERS b/MAINTAINERS index d59455c27c42..bee65ebdc67e 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -18644,7 +18644,7 @@ L: linux-hams@vger.kernel.org S: Maintained W: http://yaina.de/jreuter/ W: http://www.qsl.net/dl1bke/ -F: Documentation/networking/z8530drv.txt +F: Documentation/networking/z8530drv.rst F: drivers/net/hamradio/*scc.c F: drivers/net/hamradio/z8530.h diff --git a/drivers/net/hamradio/Kconfig b/drivers/net/hamradio/Kconfig index fe409819b56d..f4500f04147d 100644 --- a/drivers/net/hamradio/Kconfig +++ b/drivers/net/hamradio/Kconfig @@ -84,7 +84,7 @@ config SCC ---help--- These cards are used to connect your Linux box to an amateur radio in order to communicate with other computers. If you want to use - this, read and the + this, read and the AX25-HOWTO, available from . Also make sure to say Y to "Amateur Radio AX.25 Level 2" support. @@ -98,7 +98,7 @@ config SCC_DELAY help Say Y here if you experience problems with the SCC driver not working properly; please read - for details. + for details. If unsure, say N. diff --git a/drivers/net/hamradio/scc.c b/drivers/net/hamradio/scc.c index 6c03932d8a6b..33fdd55c6122 100644 --- a/drivers/net/hamradio/scc.c +++ b/drivers/net/hamradio/scc.c @@ -7,7 +7,7 @@ * ------------------ * * You can find a subset of the documentation in - * Documentation/networking/z8530drv.txt. + * Documentation/networking/z8530drv.rst. */ /* -- cgit v1.2.3 From c79773e83e66fb2c22627eda3cd768f9e2bc10b5 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:34 +0200 Subject: docs: networking: device drivers: convert 3com/3c509.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - mark code blocks and literals as such; - add notes markups; - mark tables as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- .../networking/device_drivers/3com/3c509.rst | 249 +++++++++++++++++++++ .../networking/device_drivers/3com/3c509.txt | 213 ------------------ Documentation/networking/device_drivers/index.rst | 1 + 3 files changed, 250 insertions(+), 213 deletions(-) create mode 100644 Documentation/networking/device_drivers/3com/3c509.rst delete mode 100644 Documentation/networking/device_drivers/3com/3c509.txt (limited to 'Documentation') diff --git a/Documentation/networking/device_drivers/3com/3c509.rst b/Documentation/networking/device_drivers/3com/3c509.rst new file mode 100644 index 000000000000..47f706bacdd9 --- /dev/null +++ b/Documentation/networking/device_drivers/3com/3c509.rst @@ -0,0 +1,249 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================================================= +Linux and the 3Com EtherLink III Series Ethercards (driver v1.18c and higher) +============================================================================= + +This file contains the instructions and caveats for v1.18c and higher versions +of the 3c509 driver. You should not use the driver without reading this file. + +release 1.0 + +28 February 2002 + +Current maintainer (corrections to): + David Ruggiero + +Introduction +============ + +The following are notes and information on using the 3Com EtherLink III series +ethercards in Linux. These cards are commonly known by the most widely-used +card's 3Com model number, 3c509. They are all 10mb/s ISA-bus cards and shouldn't +be (but sometimes are) confused with the similarly-numbered PCI-bus "3c905" +(aka "Vortex" or "Boomerang") series. Kernel support for the 3c509 family is +provided by the module 3c509.c, which has code to support all of the following +models: + + - 3c509 (original ISA card) + - 3c509B (later revision of the ISA card; supports full-duplex) + - 3c589 (PCMCIA) + - 3c589B (later revision of the 3c589; supports full-duplex) + - 3c579 (EISA) + +Large portions of this documentation were heavily borrowed from the guide +written the original author of the 3c509 driver, Donald Becker. The master +copy of that document, which contains notes on older versions of the driver, +currently resides on Scyld web server: http://www.scyld.com/. + + +Special Driver Features +======================= + +Overriding card settings + +The driver allows boot- or load-time overriding of the card's detected IOADDR, +IRQ, and transceiver settings, although this capability shouldn't generally be +needed except to enable full-duplex mode (see below). An example of the syntax +for LILO parameters for doing this:: + + ether=10,0x310,3,0x3c509,eth0 + +This configures the first found 3c509 card for IRQ 10, base I/O 0x310, and +transceiver type 3 (10base2). The flag "0x3c509" must be set to avoid conflicts +with other card types when overriding the I/O address. When the driver is +loaded as a module, only the IRQ may be overridden. For example, +setting two cards to IRQ10 and IRQ11 is done by using the irq module +option:: + + options 3c509 irq=10,11 + + +Full-duplex mode +================ + +The v1.18c driver added support for the 3c509B's full-duplex capabilities. +In order to enable and successfully use full-duplex mode, three conditions +must be met: + +(a) You must have a Etherlink III card model whose hardware supports full- +duplex operations. Currently, the only members of the 3c509 family that are +positively known to support full-duplex are the 3c509B (ISA bus) and 3c589B +(PCMCIA) cards. Cards without the "B" model designation do *not* support +full-duplex mode; these include the original 3c509 (no "B"), the original +3c589, the 3c529 (MCA bus), and the 3c579 (EISA bus). + +(b) You must be using your card's 10baseT transceiver (i.e., the RJ-45 +connector), not its AUI (thick-net) or 10base2 (thin-net/coax) interfaces. +AUI and 10base2 network cabling is physically incapable of full-duplex +operation. + +(c) Most importantly, your 3c509B must be connected to a link partner that is +itself full-duplex capable. This is almost certainly one of two things: a full- +duplex-capable Ethernet switch (*not* a hub), or a full-duplex-capable NIC on +another system that's connected directly to the 3c509B via a crossover cable. + +Full-duplex mode can be enabled using 'ethtool'. + +.. warning:: + + Extremely important caution concerning full-duplex mode + + Understand that the 3c509B's hardware's full-duplex support is much more + limited than that provide by more modern network interface cards. Although + at the physical layer of the network it fully supports full-duplex operation, + the card was designed before the current Ethernet auto-negotiation (N-way) + spec was written. This means that the 3c509B family ***cannot and will not + auto-negotiate a full-duplex connection with its link partner under any + circumstances, no matter how it is initialized***. If the full-duplex mode + of the 3c509B is enabled, its link partner will very likely need to be + independently _forced_ into full-duplex mode as well; otherwise various nasty + failures will occur - at the very least, you'll see massive numbers of packet + collisions. This is one of very rare circumstances where disabling auto- + negotiation and forcing the duplex mode of a network interface card or switch + would ever be necessary or desirable. + + +Available Transceiver Types +=========================== + +For versions of the driver v1.18c and above, the available transceiver types are: + +== ========================================================================= +0 transceiver type from EEPROM config (normally 10baseT); force half-duplex +1 AUI (thick-net / DB15 connector) +2 (undefined) +3 10base2 (thin-net == coax / BNC connector) +4 10baseT (RJ-45 connector); force half-duplex mode +8 transceiver type and duplex mode taken from card's EEPROM config settings +12 10baseT (RJ-45 connector); force full-duplex mode +== ========================================================================= + +Prior to driver version 1.18c, only transceiver codes 0-4 were supported. Note +that the new transceiver codes 8 and 12 are the *only* ones that will enable +full-duplex mode, no matter what the card's detected EEPROM settings might be. +This insured that merely upgrading the driver from an earlier version would +never automatically enable full-duplex mode in an existing installation; +it must always be explicitly enabled via one of these code in order to be +activated. + +The transceiver type can be changed using 'ethtool'. + + +Interpretation of error messages and common problems +---------------------------------------------------- + +Error Messages +^^^^^^^^^^^^^^ + +eth0: Infinite loop in interrupt, status 2011. +These are "mostly harmless" message indicating that the driver had too much +work during that interrupt cycle. With a status of 0x2011 you are receiving +packets faster than they can be removed from the card. This should be rare +or impossible in normal operation. Possible causes of this error report are: + + - a "green" mode enabled that slows the processor down when there is no + keyboard activity. + + - some other device or device driver hogging the bus or disabling interrupts. + Check /proc/interrupts for excessive interrupt counts. The timer tick + interrupt should always be incrementing faster than the others. + +No received packets +^^^^^^^^^^^^^^^^^^^ + +If a 3c509, 3c562 or 3c589 can successfully transmit packets, but never +receives packets (as reported by /proc/net/dev or 'ifconfig') you likely +have an interrupt line problem. Check /proc/interrupts to verify that the +card is actually generating interrupts. If the interrupt count is not +increasing you likely have a physical conflict with two devices trying to +use the same ISA IRQ line. The common conflict is with a sound card on IRQ10 +or IRQ5, and the easiest solution is to move the 3c509 to a different +interrupt line. If the device is receiving packets but 'ping' doesn't work, +you have a routing problem. + +Tx Carrier Errors Reported in /proc/net/dev +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + +If an EtherLink III appears to transmit packets, but the "Tx carrier errors" +field in /proc/net/dev increments as quickly as the Tx packet count, you +likely have an unterminated network or the incorrect media transceiver selected. + +3c509B card is not detected on machines with an ISA PnP BIOS. +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +While the updated driver works with most PnP BIOS programs, it does not work +with all. This can be fixed by disabling PnP support using the 3Com-supplied +setup program. + +3c509 card is not detected on overclocked machines +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Increase the delay time in id_read_eeprom() from the current value, 500, +to an absurdly high value, such as 5000. + + +Decoding Status and Error Messages +---------------------------------- + + +The bits in the main status register are: + +===== ====================================== +value description +===== ====================================== +0x01 Interrupt latch +0x02 Tx overrun, or Rx underrun +0x04 Tx complete +0x08 Tx FIFO room available +0x10 A complete Rx packet has arrived +0x20 A Rx packet has started to arrive +0x40 The driver has requested an interrupt +0x80 Statistics counter nearly full +===== ====================================== + +The bits in the transmit (Tx) status word are: + +===== ============================================ +value description +===== ============================================ +0x02 Out-of-window collision. +0x04 Status stack overflow (normally impossible). +0x08 16 collisions. +0x10 Tx underrun (not enough PCI bus bandwidth). +0x20 Tx jabber. +0x40 Tx interrupt requested. +0x80 Status is valid (this should always be set). +===== ============================================ + + +When a transmit error occurs the driver produces a status message such as:: + + eth0: Transmit error, Tx status register 82 + +The two values typically seen here are: + +0x82 +^^^^ + +Out of window collision. This typically occurs when some other Ethernet +host is incorrectly set to full duplex on a half duplex network. + +0x88 +^^^^ + +16 collisions. This typically occurs when the network is exceptionally busy +or when another host doesn't correctly back off after a collision. If this +error is mixed with 0x82 errors it is the result of a host incorrectly set +to full duplex (see above). + +Both of these errors are the result of network problems that should be +corrected. They do not represent driver malfunction. + + +Revision history (this file) +============================ + +28Feb02 v1.0 DR New; major portions based on Becker original 3c509 docs + diff --git a/Documentation/networking/device_drivers/3com/3c509.txt b/Documentation/networking/device_drivers/3com/3c509.txt deleted file mode 100644 index fbf722e15ac3..000000000000 --- a/Documentation/networking/device_drivers/3com/3c509.txt +++ /dev/null @@ -1,213 +0,0 @@ -Linux and the 3Com EtherLink III Series Ethercards (driver v1.18c and higher) ----------------------------------------------------------------------------- - -This file contains the instructions and caveats for v1.18c and higher versions -of the 3c509 driver. You should not use the driver without reading this file. - -release 1.0 -28 February 2002 -Current maintainer (corrections to): - David Ruggiero - ----------------------------------------------------------------------------- - -(0) Introduction - -The following are notes and information on using the 3Com EtherLink III series -ethercards in Linux. These cards are commonly known by the most widely-used -card's 3Com model number, 3c509. They are all 10mb/s ISA-bus cards and shouldn't -be (but sometimes are) confused with the similarly-numbered PCI-bus "3c905" -(aka "Vortex" or "Boomerang") series. Kernel support for the 3c509 family is -provided by the module 3c509.c, which has code to support all of the following -models: - - 3c509 (original ISA card) - 3c509B (later revision of the ISA card; supports full-duplex) - 3c589 (PCMCIA) - 3c589B (later revision of the 3c589; supports full-duplex) - 3c579 (EISA) - -Large portions of this documentation were heavily borrowed from the guide -written the original author of the 3c509 driver, Donald Becker. The master -copy of that document, which contains notes on older versions of the driver, -currently resides on Scyld web server: http://www.scyld.com/. - - -(1) Special Driver Features - -Overriding card settings - -The driver allows boot- or load-time overriding of the card's detected IOADDR, -IRQ, and transceiver settings, although this capability shouldn't generally be -needed except to enable full-duplex mode (see below). An example of the syntax -for LILO parameters for doing this: - - ether=10,0x310,3,0x3c509,eth0 - -This configures the first found 3c509 card for IRQ 10, base I/O 0x310, and -transceiver type 3 (10base2). The flag "0x3c509" must be set to avoid conflicts -with other card types when overriding the I/O address. When the driver is -loaded as a module, only the IRQ may be overridden. For example, -setting two cards to IRQ10 and IRQ11 is done by using the irq module -option: - - options 3c509 irq=10,11 - - -(2) Full-duplex mode - -The v1.18c driver added support for the 3c509B's full-duplex capabilities. -In order to enable and successfully use full-duplex mode, three conditions -must be met: - -(a) You must have a Etherlink III card model whose hardware supports full- -duplex operations. Currently, the only members of the 3c509 family that are -positively known to support full-duplex are the 3c509B (ISA bus) and 3c589B -(PCMCIA) cards. Cards without the "B" model designation do *not* support -full-duplex mode; these include the original 3c509 (no "B"), the original -3c589, the 3c529 (MCA bus), and the 3c579 (EISA bus). - -(b) You must be using your card's 10baseT transceiver (i.e., the RJ-45 -connector), not its AUI (thick-net) or 10base2 (thin-net/coax) interfaces. -AUI and 10base2 network cabling is physically incapable of full-duplex -operation. - -(c) Most importantly, your 3c509B must be connected to a link partner that is -itself full-duplex capable. This is almost certainly one of two things: a full- -duplex-capable Ethernet switch (*not* a hub), or a full-duplex-capable NIC on -another system that's connected directly to the 3c509B via a crossover cable. - -Full-duplex mode can be enabled using 'ethtool'. - -/////Extremely important caution concerning full-duplex mode///// -Understand that the 3c509B's hardware's full-duplex support is much more -limited than that provide by more modern network interface cards. Although -at the physical layer of the network it fully supports full-duplex operation, -the card was designed before the current Ethernet auto-negotiation (N-way) -spec was written. This means that the 3c509B family ***cannot and will not -auto-negotiate a full-duplex connection with its link partner under any -circumstances, no matter how it is initialized***. If the full-duplex mode -of the 3c509B is enabled, its link partner will very likely need to be -independently _forced_ into full-duplex mode as well; otherwise various nasty -failures will occur - at the very least, you'll see massive numbers of packet -collisions. This is one of very rare circumstances where disabling auto- -negotiation and forcing the duplex mode of a network interface card or switch -would ever be necessary or desirable. - - -(3) Available Transceiver Types - -For versions of the driver v1.18c and above, the available transceiver types are: - -0 transceiver type from EEPROM config (normally 10baseT); force half-duplex -1 AUI (thick-net / DB15 connector) -2 (undefined) -3 10base2 (thin-net == coax / BNC connector) -4 10baseT (RJ-45 connector); force half-duplex mode -8 transceiver type and duplex mode taken from card's EEPROM config settings -12 10baseT (RJ-45 connector); force full-duplex mode - -Prior to driver version 1.18c, only transceiver codes 0-4 were supported. Note -that the new transceiver codes 8 and 12 are the *only* ones that will enable -full-duplex mode, no matter what the card's detected EEPROM settings might be. -This insured that merely upgrading the driver from an earlier version would -never automatically enable full-duplex mode in an existing installation; -it must always be explicitly enabled via one of these code in order to be -activated. - -The transceiver type can be changed using 'ethtool'. - - -(4a) Interpretation of error messages and common problems - -Error Messages - -eth0: Infinite loop in interrupt, status 2011. -These are "mostly harmless" message indicating that the driver had too much -work during that interrupt cycle. With a status of 0x2011 you are receiving -packets faster than they can be removed from the card. This should be rare -or impossible in normal operation. Possible causes of this error report are: - - - a "green" mode enabled that slows the processor down when there is no - keyboard activity. - - - some other device or device driver hogging the bus or disabling interrupts. - Check /proc/interrupts for excessive interrupt counts. The timer tick - interrupt should always be incrementing faster than the others. - -No received packets -If a 3c509, 3c562 or 3c589 can successfully transmit packets, but never -receives packets (as reported by /proc/net/dev or 'ifconfig') you likely -have an interrupt line problem. Check /proc/interrupts to verify that the -card is actually generating interrupts. If the interrupt count is not -increasing you likely have a physical conflict with two devices trying to -use the same ISA IRQ line. The common conflict is with a sound card on IRQ10 -or IRQ5, and the easiest solution is to move the 3c509 to a different -interrupt line. If the device is receiving packets but 'ping' doesn't work, -you have a routing problem. - -Tx Carrier Errors Reported in /proc/net/dev -If an EtherLink III appears to transmit packets, but the "Tx carrier errors" -field in /proc/net/dev increments as quickly as the Tx packet count, you -likely have an unterminated network or the incorrect media transceiver selected. - -3c509B card is not detected on machines with an ISA PnP BIOS. -While the updated driver works with most PnP BIOS programs, it does not work -with all. This can be fixed by disabling PnP support using the 3Com-supplied -setup program. - -3c509 card is not detected on overclocked machines -Increase the delay time in id_read_eeprom() from the current value, 500, -to an absurdly high value, such as 5000. - - -(4b) Decoding Status and Error Messages - -The bits in the main status register are: - -value description -0x01 Interrupt latch -0x02 Tx overrun, or Rx underrun -0x04 Tx complete -0x08 Tx FIFO room available -0x10 A complete Rx packet has arrived -0x20 A Rx packet has started to arrive -0x40 The driver has requested an interrupt -0x80 Statistics counter nearly full - -The bits in the transmit (Tx) status word are: - -value description -0x02 Out-of-window collision. -0x04 Status stack overflow (normally impossible). -0x08 16 collisions. -0x10 Tx underrun (not enough PCI bus bandwidth). -0x20 Tx jabber. -0x40 Tx interrupt requested. -0x80 Status is valid (this should always be set). - - -When a transmit error occurs the driver produces a status message such as - - eth0: Transmit error, Tx status register 82 - -The two values typically seen here are: - -0x82 -Out of window collision. This typically occurs when some other Ethernet -host is incorrectly set to full duplex on a half duplex network. - -0x88 -16 collisions. This typically occurs when the network is exceptionally busy -or when another host doesn't correctly back off after a collision. If this -error is mixed with 0x82 errors it is the result of a host incorrectly set -to full duplex (see above). - -Both of these errors are the result of network problems that should be -corrected. They do not represent driver malfunction. - - -(5) Revision history (this file) - -28Feb02 v1.0 DR New; major portions based on Becker original 3c509 docs - diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst index a191faaf97de..402a9188f446 100644 --- a/Documentation/networking/device_drivers/index.rst +++ b/Documentation/networking/device_drivers/index.rst @@ -27,6 +27,7 @@ Contents: netronome/nfp pensando/ionic stmicro/stmmac + 3com/3c509 .. only:: subproject and html -- cgit v1.2.3 From 9ea2af8d16f5612168ed52cb0ec6752bac0877a9 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:35 +0200 Subject: docs: networking: device drivers: convert 3com/vortex.txt to ReST - add SPDX header; - add a document title; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- .../networking/device_drivers/3com/vortex.rst | 461 +++++++++++++++++++++ .../networking/device_drivers/3com/vortex.txt | 448 -------------------- Documentation/networking/device_drivers/index.rst | 1 + MAINTAINERS | 2 +- drivers/net/ethernet/3com/3c59x.c | 4 +- drivers/net/ethernet/3com/Kconfig | 2 +- 6 files changed, 466 insertions(+), 452 deletions(-) create mode 100644 Documentation/networking/device_drivers/3com/vortex.rst delete mode 100644 Documentation/networking/device_drivers/3com/vortex.txt (limited to 'Documentation') diff --git a/Documentation/networking/device_drivers/3com/vortex.rst b/Documentation/networking/device_drivers/3com/vortex.rst new file mode 100644 index 000000000000..800add5be338 --- /dev/null +++ b/Documentation/networking/device_drivers/3com/vortex.rst @@ -0,0 +1,461 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================= +3Com Vortex device driver +========================= + +Documentation/networking/device_drivers/3com/vortex.rst + +Andrew Morton + +30 April 2000 + + +This document describes the usage and errata of the 3Com "Vortex" device +driver for Linux, 3c59x.c. + +The driver was written by Donald Becker + +Don is no longer the prime maintainer of this version of the driver. +Please report problems to one or more of: + +- Andrew Morton +- Netdev mailing list +- Linux kernel mailing list + +Please note the 'Reporting and Diagnosing Problems' section at the end +of this file. + + +Since kernel 2.3.99-pre6, this driver incorporates the support for the +3c575-series Cardbus cards which used to be handled by 3c575_cb.c. + +This driver supports the following hardware: + + - 3c590 Vortex 10Mbps + - 3c592 EISA 10Mbps Demon/Vortex + - 3c597 EISA Fast Demon/Vortex + - 3c595 Vortex 100baseTx + - 3c595 Vortex 100baseT4 + - 3c595 Vortex 100base-MII + - 3c900 Boomerang 10baseT + - 3c900 Boomerang 10Mbps Combo + - 3c900 Cyclone 10Mbps TPO + - 3c900 Cyclone 10Mbps Combo + - 3c900 Cyclone 10Mbps TPC + - 3c900B-FL Cyclone 10base-FL + - 3c905 Boomerang 100baseTx + - 3c905 Boomerang 100baseT4 + - 3c905B Cyclone 100baseTx + - 3c905B Cyclone 10/100/BNC + - 3c905B-FX Cyclone 100baseFx + - 3c905C Tornado + - 3c920B-EMB-WNM (ATI Radeon 9100 IGP) + - 3c980 Cyclone + - 3c980C Python-T + - 3cSOHO100-TX Hurricane + - 3c555 Laptop Hurricane + - 3c556 Laptop Tornado + - 3c556B Laptop Hurricane + - 3c575 [Megahertz] 10/100 LAN CardBus + - 3c575 Boomerang CardBus + - 3CCFE575BT Cyclone CardBus + - 3CCFE575CT Tornado CardBus + - 3CCFE656 Cyclone CardBus + - 3CCFEM656B Cyclone+Winmodem CardBus + - 3CXFEM656C Tornado+Winmodem CardBus + - 3c450 HomePNA Tornado + - 3c920 Tornado + - 3c982 Hydra Dual Port A + - 3c982 Hydra Dual Port B + - 3c905B-T4 + - 3c920B-EMB-WNM Tornado + +Module parameters +================= + +There are several parameters which may be provided to the driver when +its module is loaded. These are usually placed in ``/etc/modprobe.d/*.conf`` +configuration files. Example:: + + options 3c59x debug=3 rx_copybreak=300 + +If you are using the PCMCIA tools (cardmgr) then the options may be +placed in /etc/pcmcia/config.opts:: + + module "3c59x" opts "debug=3 rx_copybreak=300" + + +The supported parameters are: + +debug=N + + Where N is a number from 0 to 7. Anything above 3 produces a lot + of output in your system logs. debug=1 is default. + +options=N1,N2,N3,... + + Each number in the list provides an option to the corresponding + network card. So if you have two 3c905's and you wish to provide + them with option 0x204 you would use:: + + options=0x204,0x204 + + The individual options are composed of a number of bitfields which + have the following meanings: + + Possible media type settings + + == ================================= + 0 10baseT + 1 10Mbs AUI + 2 undefined + 3 10base2 (BNC) + 4 100base-TX + 5 100base-FX + 6 MII (Media Independent Interface) + 7 Use default setting from EEPROM + 8 Autonegotiate + 9 External MII + 10 Use default setting from EEPROM + == ================================= + + When generating a value for the 'options' setting, the above media + selection values may be OR'ed (or added to) the following: + + ====== ============================================= + 0x8000 Set driver debugging level to 7 + 0x4000 Set driver debugging level to 2 + 0x0400 Enable Wake-on-LAN + 0x0200 Force full duplex mode. + 0x0010 Bus-master enable bit (Old Vortex cards only) + ====== ============================================= + + For example:: + + insmod 3c59x options=0x204 + + will force full-duplex 100base-TX, rather than allowing the usual + autonegotiation. + +global_options=N + + Sets the ``options`` parameter for all 3c59x NICs in the machine. + Entries in the ``options`` array above will override any setting of + this. + +full_duplex=N1,N2,N3... + + Similar to bit 9 of 'options'. Forces the corresponding card into + full-duplex mode. Please use this in preference to the ``options`` + parameter. + + In fact, please don't use this at all! You're better off getting + autonegotiation working properly. + +global_full_duplex=N1 + + Sets full duplex mode for all 3c59x NICs in the machine. Entries + in the ``full_duplex`` array above will override any setting of this. + +flow_ctrl=N1,N2,N3... + + Use 802.3x MAC-layer flow control. The 3com cards only support the + PAUSE command, which means that they will stop sending packets for a + short period if they receive a PAUSE frame from the link partner. + + The driver only allows flow control on a link which is operating in + full duplex mode. + + This feature does not appear to work on the 3c905 - only 3c905B and + 3c905C have been tested. + + The 3com cards appear to only respond to PAUSE frames which are + sent to the reserved destination address of 01:80:c2:00:00:01. They + do not honour PAUSE frames which are sent to the station MAC address. + +rx_copybreak=M + + The driver preallocates 32 full-sized (1536 byte) network buffers + for receiving. When a packet arrives, the driver has to decide + whether to leave the packet in its full-sized buffer, or to allocate + a smaller buffer and copy the packet across into it. + + This is a speed/space tradeoff. + + The value of rx_copybreak is used to decide when to make the copy. + If the packet size is less than rx_copybreak, the packet is copied. + The default value for rx_copybreak is 200 bytes. + +max_interrupt_work=N + + The driver's interrupt service routine can handle many receive and + transmit packets in a single invocation. It does this in a loop. + The value of max_interrupt_work governs how many times the interrupt + service routine will loop. The default value is 32 loops. If this + is exceeded the interrupt service routine gives up and generates a + warning message "eth0: Too much work in interrupt". + +hw_checksums=N1,N2,N3,... + + Recent 3com NICs are able to generate IPv4, TCP and UDP checksums + in hardware. Linux has used the Rx checksumming for a long time. + The "zero copy" patch which is planned for the 2.4 kernel series + allows you to make use of the NIC's DMA scatter/gather and transmit + checksumming as well. + + The driver is set up so that, when the zerocopy patch is applied, + all Tornado and Cyclone devices will use S/G and Tx checksums. + + This module parameter has been provided so you can override this + decision. If you think that Tx checksums are causing a problem, you + may disable the feature with ``hw_checksums=0``. + + If you think your NIC should be performing Tx checksumming and the + driver isn't enabling it, you can force the use of hardware Tx + checksumming with ``hw_checksums=1``. + + The driver drops a message in the logfiles to indicate whether or + not it is using hardware scatter/gather and hardware Tx checksums. + + Scatter/gather and hardware checksums provide considerable + performance improvement for the sendfile() system call, but a small + decrease in throughput for send(). There is no effect upon receive + efficiency. + +compaq_ioaddr=N, +compaq_irq=N, +compaq_device_id=N + + "Variables to work-around the Compaq PCI BIOS32 problem".... + +watchdog=N + + Sets the time duration (in milliseconds) after which the kernel + decides that the transmitter has become stuck and needs to be reset. + This is mainly for debugging purposes, although it may be advantageous + to increase this value on LANs which have very high collision rates. + The default value is 5000 (5.0 seconds). + +enable_wol=N1,N2,N3,... + + Enable Wake-on-LAN support for the relevant interface. Donald + Becker's ``ether-wake`` application may be used to wake suspended + machines. + + Also enables the NIC's power management support. + +global_enable_wol=N + + Sets enable_wol mode for all 3c59x NICs in the machine. Entries in + the ``enable_wol`` array above will override any setting of this. + +Media selection +--------------- + +A number of the older NICs such as the 3c590 and 3c900 series have +10base2 and AUI interfaces. + +Prior to January, 2001 this driver would autoeselect the 10base2 or AUI +port if it didn't detect activity on the 10baseT port. It would then +get stuck on the 10base2 port and a driver reload was necessary to +switch back to 10baseT. This behaviour could not be prevented with a +module option override. + +Later (current) versions of the driver _do_ support locking of the +media type. So if you load the driver module with + + modprobe 3c59x options=0 + +it will permanently select the 10baseT port. Automatic selection of +other media types does not occur. + + +Transmit error, Tx status register 82 +------------------------------------- + +This is a common error which is almost always caused by another host on +the same network being in full-duplex mode, while this host is in +half-duplex mode. You need to find that other host and make it run in +half-duplex mode or fix this host to run in full-duplex mode. + +As a last resort, you can force the 3c59x driver into full-duplex mode +with + + options 3c59x full_duplex=1 + +but this has to be viewed as a workaround for broken network gear and +should only really be used for equipment which cannot autonegotiate. + + +Additional resources +-------------------- + +Details of the device driver implementation are at the top of the source file. + +Additional documentation is available at Don Becker's Linux Drivers site: + + http://www.scyld.com/vortex.html + +Donald Becker's driver development site: + + http://www.scyld.com/network.html + +Donald's vortex-diag program is useful for inspecting the NIC's state: + + http://www.scyld.com/ethercard_diag.html + +Donald's mii-diag program may be used for inspecting and manipulating +the NIC's Media Independent Interface subsystem: + + http://www.scyld.com/ethercard_diag.html#mii-diag + +Donald's wake-on-LAN page: + + http://www.scyld.com/wakeonlan.html + +3Com's DOS-based application for setting up the NICs EEPROMs: + + ftp://ftp.3com.com/pub/nic/3c90x/3c90xx2.exe + + +Autonegotiation notes +--------------------- + + The driver uses a one-minute heartbeat for adapting to changes in + the external LAN environment if link is up and 5 seconds if link is down. + This means that when, for example, a machine is unplugged from a hubbed + 10baseT LAN plugged into a switched 100baseT LAN, the throughput + will be quite dreadful for up to sixty seconds. Be patient. + + Cisco interoperability note from Walter Wong : + + On a side note, adding HAS_NWAY seems to share a problem with the + Cisco 6509 switch. Specifically, you need to change the spanning + tree parameter for the port the machine is plugged into to 'portfast' + mode. Otherwise, the negotiation fails. This has been an issue + we've noticed for a while but haven't had the time to track down. + + Cisco switches (Jeff Busch ) + + My "standard config" for ports to which PC's/servers connect directly:: + + interface FastEthernet0/N + description machinename + load-interval 30 + spanning-tree portfast + + If autonegotiation is a problem, you may need to specify "speed + 100" and "duplex full" as well (or "speed 10" and "duplex half"). + + WARNING: DO NOT hook up hubs/switches/bridges to these + specially-configured ports! The switch will become very confused. + + +Reporting and diagnosing problems +--------------------------------- + +Maintainers find that accurate and complete problem reports are +invaluable in resolving driver problems. We are frequently not able to +reproduce problems and must rely on your patience and efforts to get to +the bottom of the problem. + +If you believe you have a driver problem here are some of the +steps you should take: + +- Is it really a driver problem? + + Eliminate some variables: try different cards, different + computers, different cables, different ports on the switch/hub, + different versions of the kernel or of the driver, etc. + +- OK, it's a driver problem. + + You need to generate a report. Typically this is an email to the + maintainer and/or netdev@vger.kernel.org. The maintainer's + email address will be in the driver source or in the MAINTAINERS file. + +- The contents of your report will vary a lot depending upon the + problem. If it's a kernel crash then you should refer to the + admin-guide/reporting-bugs.rst file. + + But for most problems it is useful to provide the following: + + - Kernel version, driver version + + - A copy of the banner message which the driver generates when + it is initialised. For example: + + eth0: 3Com PCI 3c905C Tornado at 0xa400, 00:50:da:6a:88:f0, IRQ 19 + 8K byte-wide RAM 5:3 Rx:Tx split, autoselect/Autonegotiate interface. + MII transceiver found at address 24, status 782d. + Enabling bus-master transmits and whole-frame receives. + + NOTE: You must provide the ``debug=2`` modprobe option to generate + a full detection message. Please do this:: + + modprobe 3c59x debug=2 + + - If it is a PCI device, the relevant output from 'lspci -vx', eg:: + + 00:09.0 Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink] (rev 74) + Subsystem: 3Com Corporation: Unknown device 9200 + Flags: bus master, medium devsel, latency 32, IRQ 19 + I/O ports at a400 [size=128] + Memory at db000000 (32-bit, non-prefetchable) [size=128] + Expansion ROM at [disabled] [size=128K] + Capabilities: [dc] Power Management version 2 + 00: b7 10 00 92 07 00 10 02 74 00 00 02 08 20 00 00 + 10: 01 a4 00 00 00 00 00 db 00 00 00 00 00 00 00 00 + 20: 00 00 00 00 00 00 00 00 00 00 00 00 b7 10 00 10 + 30: 00 00 00 00 dc 00 00 00 00 00 00 00 05 01 0a 0a + + - A description of the environment: 10baseT? 100baseT? + full/half duplex? switched or hubbed? + + - Any additional module parameters which you may be providing to the driver. + + - Any kernel logs which are produced. The more the merrier. + If this is a large file and you are sending your report to a + mailing list, mention that you have the logfile, but don't send + it. If you're reporting direct to the maintainer then just send + it. + + To ensure that all kernel logs are available, add the + following line to /etc/syslog.conf:: + + kern.* /var/log/messages + + Then restart syslogd with:: + + /etc/rc.d/init.d/syslog restart + + (The above may vary, depending upon which Linux distribution you use). + + - If your problem is reproducible then that's great. Try the + following: + + 1) Increase the debug level. Usually this is done via: + + a) modprobe driver debug=7 + b) In /etc/modprobe.d/driver.conf: + options driver debug=7 + + 2) Recreate the problem with the higher debug level, + send all logs to the maintainer. + + 3) Download you card's diagnostic tool from Donald + Becker's website . + Download mii-diag.c as well. Build these. + + a) Run 'vortex-diag -aaee' and 'mii-diag -v' when the card is + working correctly. Save the output. + + b) Run the above commands when the card is malfunctioning. Send + both sets of output. + +Finally, please be patient and be prepared to do some work. You may +end up working on this problem for a week or more as the maintainer +asks more questions, asks for more tests, asks for patches to be +applied, etc. At the end of it all, the problem may even remain +unresolved. diff --git a/Documentation/networking/device_drivers/3com/vortex.txt b/Documentation/networking/device_drivers/3com/vortex.txt deleted file mode 100644 index 587f3fcfbcae..000000000000 --- a/Documentation/networking/device_drivers/3com/vortex.txt +++ /dev/null @@ -1,448 +0,0 @@ -Documentation/networking/device_drivers/3com/vortex.txt -Andrew Morton -30 April 2000 - - -This document describes the usage and errata of the 3Com "Vortex" device -driver for Linux, 3c59x.c. - -The driver was written by Donald Becker - -Don is no longer the prime maintainer of this version of the driver. -Please report problems to one or more of: - - Andrew Morton - Netdev mailing list - Linux kernel mailing list - -Please note the 'Reporting and Diagnosing Problems' section at the end -of this file. - - -Since kernel 2.3.99-pre6, this driver incorporates the support for the -3c575-series Cardbus cards which used to be handled by 3c575_cb.c. - -This driver supports the following hardware: - - 3c590 Vortex 10Mbps - 3c592 EISA 10Mbps Demon/Vortex - 3c597 EISA Fast Demon/Vortex - 3c595 Vortex 100baseTx - 3c595 Vortex 100baseT4 - 3c595 Vortex 100base-MII - 3c900 Boomerang 10baseT - 3c900 Boomerang 10Mbps Combo - 3c900 Cyclone 10Mbps TPO - 3c900 Cyclone 10Mbps Combo - 3c900 Cyclone 10Mbps TPC - 3c900B-FL Cyclone 10base-FL - 3c905 Boomerang 100baseTx - 3c905 Boomerang 100baseT4 - 3c905B Cyclone 100baseTx - 3c905B Cyclone 10/100/BNC - 3c905B-FX Cyclone 100baseFx - 3c905C Tornado - 3c920B-EMB-WNM (ATI Radeon 9100 IGP) - 3c980 Cyclone - 3c980C Python-T - 3cSOHO100-TX Hurricane - 3c555 Laptop Hurricane - 3c556 Laptop Tornado - 3c556B Laptop Hurricane - 3c575 [Megahertz] 10/100 LAN CardBus - 3c575 Boomerang CardBus - 3CCFE575BT Cyclone CardBus - 3CCFE575CT Tornado CardBus - 3CCFE656 Cyclone CardBus - 3CCFEM656B Cyclone+Winmodem CardBus - 3CXFEM656C Tornado+Winmodem CardBus - 3c450 HomePNA Tornado - 3c920 Tornado - 3c982 Hydra Dual Port A - 3c982 Hydra Dual Port B - 3c905B-T4 - 3c920B-EMB-WNM Tornado - -Module parameters -================= - -There are several parameters which may be provided to the driver when -its module is loaded. These are usually placed in /etc/modprobe.d/*.conf -configuration files. Example: - -options 3c59x debug=3 rx_copybreak=300 - -If you are using the PCMCIA tools (cardmgr) then the options may be -placed in /etc/pcmcia/config.opts: - -module "3c59x" opts "debug=3 rx_copybreak=300" - - -The supported parameters are: - -debug=N - - Where N is a number from 0 to 7. Anything above 3 produces a lot - of output in your system logs. debug=1 is default. - -options=N1,N2,N3,... - - Each number in the list provides an option to the corresponding - network card. So if you have two 3c905's and you wish to provide - them with option 0x204 you would use: - - options=0x204,0x204 - - The individual options are composed of a number of bitfields which - have the following meanings: - - Possible media type settings - 0 10baseT - 1 10Mbs AUI - 2 undefined - 3 10base2 (BNC) - 4 100base-TX - 5 100base-FX - 6 MII (Media Independent Interface) - 7 Use default setting from EEPROM - 8 Autonegotiate - 9 External MII - 10 Use default setting from EEPROM - - When generating a value for the 'options' setting, the above media - selection values may be OR'ed (or added to) the following: - - 0x8000 Set driver debugging level to 7 - 0x4000 Set driver debugging level to 2 - 0x0400 Enable Wake-on-LAN - 0x0200 Force full duplex mode. - 0x0010 Bus-master enable bit (Old Vortex cards only) - - For example: - - insmod 3c59x options=0x204 - - will force full-duplex 100base-TX, rather than allowing the usual - autonegotiation. - -global_options=N - - Sets the `options' parameter for all 3c59x NICs in the machine. - Entries in the `options' array above will override any setting of - this. - -full_duplex=N1,N2,N3... - - Similar to bit 9 of 'options'. Forces the corresponding card into - full-duplex mode. Please use this in preference to the `options' - parameter. - - In fact, please don't use this at all! You're better off getting - autonegotiation working properly. - -global_full_duplex=N1 - - Sets full duplex mode for all 3c59x NICs in the machine. Entries - in the `full_duplex' array above will override any setting of this. - -flow_ctrl=N1,N2,N3... - - Use 802.3x MAC-layer flow control. The 3com cards only support the - PAUSE command, which means that they will stop sending packets for a - short period if they receive a PAUSE frame from the link partner. - - The driver only allows flow control on a link which is operating in - full duplex mode. - - This feature does not appear to work on the 3c905 - only 3c905B and - 3c905C have been tested. - - The 3com cards appear to only respond to PAUSE frames which are - sent to the reserved destination address of 01:80:c2:00:00:01. They - do not honour PAUSE frames which are sent to the station MAC address. - -rx_copybreak=M - - The driver preallocates 32 full-sized (1536 byte) network buffers - for receiving. When a packet arrives, the driver has to decide - whether to leave the packet in its full-sized buffer, or to allocate - a smaller buffer and copy the packet across into it. - - This is a speed/space tradeoff. - - The value of rx_copybreak is used to decide when to make the copy. - If the packet size is less than rx_copybreak, the packet is copied. - The default value for rx_copybreak is 200 bytes. - -max_interrupt_work=N - - The driver's interrupt service routine can handle many receive and - transmit packets in a single invocation. It does this in a loop. - The value of max_interrupt_work governs how many times the interrupt - service routine will loop. The default value is 32 loops. If this - is exceeded the interrupt service routine gives up and generates a - warning message "eth0: Too much work in interrupt". - -hw_checksums=N1,N2,N3,... - - Recent 3com NICs are able to generate IPv4, TCP and UDP checksums - in hardware. Linux has used the Rx checksumming for a long time. - The "zero copy" patch which is planned for the 2.4 kernel series - allows you to make use of the NIC's DMA scatter/gather and transmit - checksumming as well. - - The driver is set up so that, when the zerocopy patch is applied, - all Tornado and Cyclone devices will use S/G and Tx checksums. - - This module parameter has been provided so you can override this - decision. If you think that Tx checksums are causing a problem, you - may disable the feature with `hw_checksums=0'. - - If you think your NIC should be performing Tx checksumming and the - driver isn't enabling it, you can force the use of hardware Tx - checksumming with `hw_checksums=1'. - - The driver drops a message in the logfiles to indicate whether or - not it is using hardware scatter/gather and hardware Tx checksums. - - Scatter/gather and hardware checksums provide considerable - performance improvement for the sendfile() system call, but a small - decrease in throughput for send(). There is no effect upon receive - efficiency. - -compaq_ioaddr=N -compaq_irq=N -compaq_device_id=N - - "Variables to work-around the Compaq PCI BIOS32 problem".... - -watchdog=N - - Sets the time duration (in milliseconds) after which the kernel - decides that the transmitter has become stuck and needs to be reset. - This is mainly for debugging purposes, although it may be advantageous - to increase this value on LANs which have very high collision rates. - The default value is 5000 (5.0 seconds). - -enable_wol=N1,N2,N3,... - - Enable Wake-on-LAN support for the relevant interface. Donald - Becker's `ether-wake' application may be used to wake suspended - machines. - - Also enables the NIC's power management support. - -global_enable_wol=N - - Sets enable_wol mode for all 3c59x NICs in the machine. Entries in - the `enable_wol' array above will override any setting of this. - -Media selection ---------------- - -A number of the older NICs such as the 3c590 and 3c900 series have -10base2 and AUI interfaces. - -Prior to January, 2001 this driver would autoeselect the 10base2 or AUI -port if it didn't detect activity on the 10baseT port. It would then -get stuck on the 10base2 port and a driver reload was necessary to -switch back to 10baseT. This behaviour could not be prevented with a -module option override. - -Later (current) versions of the driver _do_ support locking of the -media type. So if you load the driver module with - - modprobe 3c59x options=0 - -it will permanently select the 10baseT port. Automatic selection of -other media types does not occur. - - -Transmit error, Tx status register 82 -------------------------------------- - -This is a common error which is almost always caused by another host on -the same network being in full-duplex mode, while this host is in -half-duplex mode. You need to find that other host and make it run in -half-duplex mode or fix this host to run in full-duplex mode. - -As a last resort, you can force the 3c59x driver into full-duplex mode -with - - options 3c59x full_duplex=1 - -but this has to be viewed as a workaround for broken network gear and -should only really be used for equipment which cannot autonegotiate. - - -Additional resources --------------------- - -Details of the device driver implementation are at the top of the source file. - -Additional documentation is available at Don Becker's Linux Drivers site: - - http://www.scyld.com/vortex.html - -Donald Becker's driver development site: - - http://www.scyld.com/network.html - -Donald's vortex-diag program is useful for inspecting the NIC's state: - - http://www.scyld.com/ethercard_diag.html - -Donald's mii-diag program may be used for inspecting and manipulating -the NIC's Media Independent Interface subsystem: - - http://www.scyld.com/ethercard_diag.html#mii-diag - -Donald's wake-on-LAN page: - - http://www.scyld.com/wakeonlan.html - -3Com's DOS-based application for setting up the NICs EEPROMs: - - ftp://ftp.3com.com/pub/nic/3c90x/3c90xx2.exe - - -Autonegotiation notes ---------------------- - - The driver uses a one-minute heartbeat for adapting to changes in - the external LAN environment if link is up and 5 seconds if link is down. - This means that when, for example, a machine is unplugged from a hubbed - 10baseT LAN plugged into a switched 100baseT LAN, the throughput - will be quite dreadful for up to sixty seconds. Be patient. - - Cisco interoperability note from Walter Wong : - - On a side note, adding HAS_NWAY seems to share a problem with the - Cisco 6509 switch. Specifically, you need to change the spanning - tree parameter for the port the machine is plugged into to 'portfast' - mode. Otherwise, the negotiation fails. This has been an issue - we've noticed for a while but haven't had the time to track down. - - Cisco switches (Jeff Busch ) - - My "standard config" for ports to which PC's/servers connect directly: - - interface FastEthernet0/N - description machinename - load-interval 30 - spanning-tree portfast - - If autonegotiation is a problem, you may need to specify "speed - 100" and "duplex full" as well (or "speed 10" and "duplex half"). - - WARNING: DO NOT hook up hubs/switches/bridges to these - specially-configured ports! The switch will become very confused. - - -Reporting and diagnosing problems ---------------------------------- - -Maintainers find that accurate and complete problem reports are -invaluable in resolving driver problems. We are frequently not able to -reproduce problems and must rely on your patience and efforts to get to -the bottom of the problem. - -If you believe you have a driver problem here are some of the -steps you should take: - -- Is it really a driver problem? - - Eliminate some variables: try different cards, different - computers, different cables, different ports on the switch/hub, - different versions of the kernel or of the driver, etc. - -- OK, it's a driver problem. - - You need to generate a report. Typically this is an email to the - maintainer and/or netdev@vger.kernel.org. The maintainer's - email address will be in the driver source or in the MAINTAINERS file. - -- The contents of your report will vary a lot depending upon the - problem. If it's a kernel crash then you should refer to the - admin-guide/reporting-bugs.rst file. - - But for most problems it is useful to provide the following: - - o Kernel version, driver version - - o A copy of the banner message which the driver generates when - it is initialised. For example: - - eth0: 3Com PCI 3c905C Tornado at 0xa400, 00:50:da:6a:88:f0, IRQ 19 - 8K byte-wide RAM 5:3 Rx:Tx split, autoselect/Autonegotiate interface. - MII transceiver found at address 24, status 782d. - Enabling bus-master transmits and whole-frame receives. - - NOTE: You must provide the `debug=2' modprobe option to generate - a full detection message. Please do this: - - modprobe 3c59x debug=2 - - o If it is a PCI device, the relevant output from 'lspci -vx', eg: - - 00:09.0 Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink] (rev 74) - Subsystem: 3Com Corporation: Unknown device 9200 - Flags: bus master, medium devsel, latency 32, IRQ 19 - I/O ports at a400 [size=128] - Memory at db000000 (32-bit, non-prefetchable) [size=128] - Expansion ROM at [disabled] [size=128K] - Capabilities: [dc] Power Management version 2 - 00: b7 10 00 92 07 00 10 02 74 00 00 02 08 20 00 00 - 10: 01 a4 00 00 00 00 00 db 00 00 00 00 00 00 00 00 - 20: 00 00 00 00 00 00 00 00 00 00 00 00 b7 10 00 10 - 30: 00 00 00 00 dc 00 00 00 00 00 00 00 05 01 0a 0a - - o A description of the environment: 10baseT? 100baseT? - full/half duplex? switched or hubbed? - - o Any additional module parameters which you may be providing to the driver. - - o Any kernel logs which are produced. The more the merrier. - If this is a large file and you are sending your report to a - mailing list, mention that you have the logfile, but don't send - it. If you're reporting direct to the maintainer then just send - it. - - To ensure that all kernel logs are available, add the - following line to /etc/syslog.conf: - - kern.* /var/log/messages - - Then restart syslogd with: - - /etc/rc.d/init.d/syslog restart - - (The above may vary, depending upon which Linux distribution you use). - - o If your problem is reproducible then that's great. Try the - following: - - 1) Increase the debug level. Usually this is done via: - - a) modprobe driver debug=7 - b) In /etc/modprobe.d/driver.conf: - options driver debug=7 - - 2) Recreate the problem with the higher debug level, - send all logs to the maintainer. - - 3) Download you card's diagnostic tool from Donald - Becker's website . - Download mii-diag.c as well. Build these. - - a) Run 'vortex-diag -aaee' and 'mii-diag -v' when the card is - working correctly. Save the output. - - b) Run the above commands when the card is malfunctioning. Send - both sets of output. - -Finally, please be patient and be prepared to do some work. You may -end up working on this problem for a week or more as the maintainer -asks more questions, asks for more tests, asks for patches to be -applied, etc. At the end of it all, the problem may even remain -unresolved. diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst index 402a9188f446..aaac502b81ea 100644 --- a/Documentation/networking/device_drivers/index.rst +++ b/Documentation/networking/device_drivers/index.rst @@ -28,6 +28,7 @@ Contents: pensando/ionic stmicro/stmmac 3com/3c509 + 3com/vortex .. only:: subproject and html diff --git a/MAINTAINERS b/MAINTAINERS index bee65ebdc67e..eaea5f1994c9 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -147,7 +147,7 @@ Maintainers List M: Steffen Klassert L: netdev@vger.kernel.org S: Odd Fixes -F: Documentation/networking/device_drivers/3com/vortex.txt +F: Documentation/networking/device_drivers/3com/vortex.rst F: drivers/net/ethernet/3com/3c59x.c 3CR990 NETWORK DRIVER diff --git a/drivers/net/ethernet/3com/3c59x.c b/drivers/net/ethernet/3com/3c59x.c index a2b7f7ab8170..5984b7033999 100644 --- a/drivers/net/ethernet/3com/3c59x.c +++ b/drivers/net/ethernet/3com/3c59x.c @@ -1149,7 +1149,7 @@ static int vortex_probe1(struct device *gendev, void __iomem *ioaddr, int irq, print_info = (vortex_debug > 1); if (print_info) - pr_info("See Documentation/networking/device_drivers/3com/vortex.txt\n"); + pr_info("See Documentation/networking/device_drivers/3com/vortex.rst\n"); pr_info("%s: 3Com %s %s at %p.\n", print_name, @@ -1954,7 +1954,7 @@ vortex_error(struct net_device *dev, int status) dev->name, tx_status); if (tx_status == 0x82) { pr_err("Probably a duplex mismatch. See " - "Documentation/networking/device_drivers/3com/vortex.txt\n"); + "Documentation/networking/device_drivers/3com/vortex.rst\n"); } dump_tx_ring(dev); } diff --git a/drivers/net/ethernet/3com/Kconfig b/drivers/net/ethernet/3com/Kconfig index 3a6fc99c6f32..7cc259893cb9 100644 --- a/drivers/net/ethernet/3com/Kconfig +++ b/drivers/net/ethernet/3com/Kconfig @@ -76,7 +76,7 @@ config VORTEX "Hurricane" (3c555/3cSOHO) PCI If you have such a card, say Y here. More specific information is in - and + and in the comments at the beginning of . -- cgit v1.2.3 From 8d299c7e912bd8ebb88b9ac2b8e336c9878783aa Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:36 +0200 Subject: docs: networking: device drivers: convert amazon/ena.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- .../networking/device_drivers/amazon/ena.rst | 344 +++++++++++++++++++++ .../networking/device_drivers/amazon/ena.txt | 308 ------------------ Documentation/networking/device_drivers/index.rst | 1 + MAINTAINERS | 2 +- 4 files changed, 346 insertions(+), 309 deletions(-) create mode 100644 Documentation/networking/device_drivers/amazon/ena.rst delete mode 100644 Documentation/networking/device_drivers/amazon/ena.txt (limited to 'Documentation') diff --git a/Documentation/networking/device_drivers/amazon/ena.rst b/Documentation/networking/device_drivers/amazon/ena.rst new file mode 100644 index 000000000000..11af6388ea87 --- /dev/null +++ b/Documentation/networking/device_drivers/amazon/ena.rst @@ -0,0 +1,344 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================================ +Linux kernel driver for Elastic Network Adapter (ENA) family +============================================================ + +Overview +======== + +ENA is a networking interface designed to make good use of modern CPU +features and system architectures. + +The ENA device exposes a lightweight management interface with a +minimal set of memory mapped registers and extendable command set +through an Admin Queue. + +The driver supports a range of ENA devices, is link-speed independent +(i.e., the same driver is used for 10GbE, 25GbE, 40GbE, etc.), and has +a negotiated and extendable feature set. + +Some ENA devices support SR-IOV. This driver is used for both the +SR-IOV Physical Function (PF) and Virtual Function (VF) devices. + +ENA devices enable high speed and low overhead network traffic +processing by providing multiple Tx/Rx queue pairs (the maximum number +is advertised by the device via the Admin Queue), a dedicated MSI-X +interrupt vector per Tx/Rx queue pair, adaptive interrupt moderation, +and CPU cacheline optimized data placement. + +The ENA driver supports industry standard TCP/IP offload features such +as checksum offload and TCP transmit segmentation offload (TSO). +Receive-side scaling (RSS) is supported for multi-core scaling. + +The ENA driver and its corresponding devices implement health +monitoring mechanisms such as watchdog, enabling the device and driver +to recover in a manner transparent to the application, as well as +debug logs. + +Some of the ENA devices support a working mode called Low-latency +Queue (LLQ), which saves several more microseconds. + +Supported PCI vendor ID/device IDs +================================== + +========= ======================= +1d0f:0ec2 ENA PF +1d0f:1ec2 ENA PF with LLQ support +1d0f:ec20 ENA VF +1d0f:ec21 ENA VF with LLQ support +========= ======================= + +ENA Source Code Directory Structure +=================================== + +================= ====================================================== +ena_com.[ch] Management communication layer. This layer is + responsible for the handling all the management + (admin) communication between the device and the + driver. +ena_eth_com.[ch] Tx/Rx data path. +ena_admin_defs.h Definition of ENA management interface. +ena_eth_io_defs.h Definition of ENA data path interface. +ena_common_defs.h Common definitions for ena_com layer. +ena_regs_defs.h Definition of ENA PCI memory-mapped (MMIO) registers. +ena_netdev.[ch] Main Linux kernel driver. +ena_syfsfs.[ch] Sysfs files. +ena_ethtool.c ethtool callbacks. +ena_pci_id_tbl.h Supported device IDs. +================= ====================================================== + +Management Interface: +===================== + +ENA management interface is exposed by means of: + +- PCIe Configuration Space +- Device Registers +- Admin Queue (AQ) and Admin Completion Queue (ACQ) +- Asynchronous Event Notification Queue (AENQ) + +ENA device MMIO Registers are accessed only during driver +initialization and are not involved in further normal device +operation. + +AQ is used for submitting management commands, and the +results/responses are reported asynchronously through ACQ. + +ENA introduces a small set of management commands with room for +vendor-specific extensions. Most of the management operations are +framed in a generic Get/Set feature command. + +The following admin queue commands are supported: + +- Create I/O submission queue +- Create I/O completion queue +- Destroy I/O submission queue +- Destroy I/O completion queue +- Get feature +- Set feature +- Configure AENQ +- Get statistics + +Refer to ena_admin_defs.h for the list of supported Get/Set Feature +properties. + +The Asynchronous Event Notification Queue (AENQ) is a uni-directional +queue used by the ENA device to send to the driver events that cannot +be reported using ACQ. AENQ events are subdivided into groups. Each +group may have multiple syndromes, as shown below + +The events are: + + ==================== =============== + Group Syndrome + ==================== =============== + Link state change **X** + Fatal error **X** + Notification Suspend traffic + Notification Resume traffic + Keep-Alive **X** + ==================== =============== + +ACQ and AENQ share the same MSI-X vector. + +Keep-Alive is a special mechanism that allows monitoring of the +device's health. The driver maintains a watchdog (WD) handler which, +if fired, logs the current state and statistics then resets and +restarts the ENA device and driver. A Keep-Alive event is delivered by +the device every second. The driver re-arms the WD upon reception of a +Keep-Alive event. A missed Keep-Alive event causes the WD handler to +fire. + +Data Path Interface +=================== +I/O operations are based on Tx and Rx Submission Queues (Tx SQ and Rx +SQ correspondingly). Each SQ has a completion queue (CQ) associated +with it. + +The SQs and CQs are implemented as descriptor rings in contiguous +physical memory. + +The ENA driver supports two Queue Operation modes for Tx SQs: + +- Regular mode + + * In this mode the Tx SQs reside in the host's memory. The ENA + device fetches the ENA Tx descriptors and packet data from host + memory. + +- Low Latency Queue (LLQ) mode or "push-mode". + + * In this mode the driver pushes the transmit descriptors and the + first 128 bytes of the packet directly to the ENA device memory + space. The rest of the packet payload is fetched by the + device. For this operation mode, the driver uses a dedicated PCI + device memory BAR, which is mapped with write-combine capability. + +The Rx SQs support only the regular mode. + +Note: Not all ENA devices support LLQ, and this feature is negotiated + with the device upon initialization. If the ENA device does not + support LLQ mode, the driver falls back to the regular mode. + +The driver supports multi-queue for both Tx and Rx. This has various +benefits: + +- Reduced CPU/thread/process contention on a given Ethernet interface. +- Cache miss rate on completion is reduced, particularly for data + cache lines that hold the sk_buff structures. +- Increased process-level parallelism when handling received packets. +- Increased data cache hit rate, by steering kernel processing of + packets to the CPU, where the application thread consuming the + packet is running. +- In hardware interrupt re-direction. + +Interrupt Modes +=============== +The driver assigns a single MSI-X vector per queue pair (for both Tx +and Rx directions). The driver assigns an additional dedicated MSI-X vector +for management (for ACQ and AENQ). + +Management interrupt registration is performed when the Linux kernel +probes the adapter, and it is de-registered when the adapter is +removed. I/O queue interrupt registration is performed when the Linux +interface of the adapter is opened, and it is de-registered when the +interface is closed. + +The management interrupt is named:: + + ena-mgmnt@pci: + +and for each queue pair, an interrupt is named:: + + -Tx-Rx- + +The ENA device operates in auto-mask and auto-clear interrupt +modes. That is, once MSI-X is delivered to the host, its Cause bit is +automatically cleared and the interrupt is masked. The interrupt is +unmasked by the driver after NAPI processing is complete. + +Interrupt Moderation +==================== +ENA driver and device can operate in conventional or adaptive interrupt +moderation mode. + +In conventional mode the driver instructs device to postpone interrupt +posting according to static interrupt delay value. The interrupt delay +value can be configured through ethtool(8). The following ethtool +parameters are supported by the driver: tx-usecs, rx-usecs + +In adaptive interrupt moderation mode the interrupt delay value is +updated by the driver dynamically and adjusted every NAPI cycle +according to the traffic nature. + +By default ENA driver applies adaptive coalescing on Rx traffic and +conventional coalescing on Tx traffic. + +Adaptive coalescing can be switched on/off through ethtool(8) +adaptive_rx on|off parameter. + +The driver chooses interrupt delay value according to the number of +bytes and packets received between interrupt unmasking and interrupt +posting. The driver uses interrupt delay table that subdivides the +range of received bytes/packets into 5 levels and assigns interrupt +delay value to each level. + +The user can enable/disable adaptive moderation, modify the interrupt +delay table and restore its default values through sysfs. + +RX copybreak +============ +The rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK +and can be configured by the ETHTOOL_STUNABLE command of the +SIOCETHTOOL ioctl. + +SKB +=== +The driver-allocated SKB for frames received from Rx handling using +NAPI context. The allocation method depends on the size of the packet. +If the frame length is larger than rx_copybreak, napi_get_frags() +is used, otherwise netdev_alloc_skb_ip_align() is used, the buffer +content is copied (by CPU) to the SKB, and the buffer is recycled. + +Statistics +========== +The user can obtain ENA device and driver statistics using ethtool. +The driver can collect regular or extended statistics (including +per-queue stats) from the device. + +In addition the driver logs the stats to syslog upon device reset. + +MTU +=== +The driver supports an arbitrarily large MTU with a maximum that is +negotiated with the device. The driver configures MTU using the +SetFeature command (ENA_ADMIN_MTU property). The user can change MTU +via ip(8) and similar legacy tools. + +Stateless Offloads +================== +The ENA driver supports: + +- TSO over IPv4/IPv6 +- TSO with ECN +- IPv4 header checksum offload +- TCP/UDP over IPv4/IPv6 checksum offloads + +RSS +=== +- The ENA device supports RSS that allows flexible Rx traffic + steering. +- Toeplitz and CRC32 hash functions are supported. +- Different combinations of L2/L3/L4 fields can be configured as + inputs for hash functions. +- The driver configures RSS settings using the AQ SetFeature command + (ENA_ADMIN_RSS_HASH_FUNCTION, ENA_ADMIN_RSS_HASH_INPUT and + ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG properties). +- If the NETIF_F_RXHASH flag is set, the 32-bit result of the hash + function delivered in the Rx CQ descriptor is set in the received + SKB. +- The user can provide a hash key, hash function, and configure the + indirection table through ethtool(8). + +DATA PATH +========= +Tx +-- + +end_start_xmit() is called by the stack. This function does the following: + +- Maps data buffers (skb->data and frags). +- Populates ena_buf for the push buffer (if the driver and device are + in push mode.) +- Prepares ENA bufs for the remaining frags. +- Allocates a new request ID from the empty req_id ring. The request + ID is the index of the packet in the Tx info. This is used for + out-of-order TX completions. +- Adds the packet to the proper place in the Tx ring. +- Calls ena_com_prepare_tx(), an ENA communication layer that converts + the ena_bufs to ENA descriptors (and adds meta ENA descriptors as + needed.) + + * This function also copies the ENA descriptors and the push buffer + to the Device memory space (if in push mode.) + +- Writes doorbell to the ENA device. +- When the ENA device finishes sending the packet, a completion + interrupt is raised. +- The interrupt handler schedules NAPI. +- The ena_clean_tx_irq() function is called. This function handles the + completion descriptors generated by the ENA, with a single + completion descriptor per completed packet. + + * req_id is retrieved from the completion descriptor. The tx_info of + the packet is retrieved via the req_id. The data buffers are + unmapped and req_id is returned to the empty req_id ring. + * The function stops when the completion descriptors are completed or + the budget is reached. + +Rx +-- + +- When a packet is received from the ENA device. +- The interrupt handler schedules NAPI. +- The ena_clean_rx_irq() function is called. This function calls + ena_rx_pkt(), an ENA communication layer function, which returns the + number of descriptors used for a new unhandled packet, and zero if + no new packet is found. +- Then it calls the ena_clean_rx_irq() function. +- ena_eth_rx_skb() checks packet length: + + * If the packet is small (len < rx_copybreak), the driver allocates + a SKB for the new packet, and copies the packet payload into the + SKB data buffer. + + - In this way the original data buffer is not passed to the stack + and is reused for future Rx packets. + + * Otherwise the function unmaps the Rx buffer, then allocates the + new SKB structure and hooks the Rx buffer to the SKB frags. + +- The new SKB is updated with the necessary information (protocol, + checksum hw verify result, etc.), and then passed to the network + stack, using the NAPI interface function napi_gro_receive(). diff --git a/Documentation/networking/device_drivers/amazon/ena.txt b/Documentation/networking/device_drivers/amazon/ena.txt deleted file mode 100644 index 1bb55c7b604c..000000000000 --- a/Documentation/networking/device_drivers/amazon/ena.txt +++ /dev/null @@ -1,308 +0,0 @@ -Linux kernel driver for Elastic Network Adapter (ENA) family: -============================================================= - -Overview: -========= -ENA is a networking interface designed to make good use of modern CPU -features and system architectures. - -The ENA device exposes a lightweight management interface with a -minimal set of memory mapped registers and extendable command set -through an Admin Queue. - -The driver supports a range of ENA devices, is link-speed independent -(i.e., the same driver is used for 10GbE, 25GbE, 40GbE, etc.), and has -a negotiated and extendable feature set. - -Some ENA devices support SR-IOV. This driver is used for both the -SR-IOV Physical Function (PF) and Virtual Function (VF) devices. - -ENA devices enable high speed and low overhead network traffic -processing by providing multiple Tx/Rx queue pairs (the maximum number -is advertised by the device via the Admin Queue), a dedicated MSI-X -interrupt vector per Tx/Rx queue pair, adaptive interrupt moderation, -and CPU cacheline optimized data placement. - -The ENA driver supports industry standard TCP/IP offload features such -as checksum offload and TCP transmit segmentation offload (TSO). -Receive-side scaling (RSS) is supported for multi-core scaling. - -The ENA driver and its corresponding devices implement health -monitoring mechanisms such as watchdog, enabling the device and driver -to recover in a manner transparent to the application, as well as -debug logs. - -Some of the ENA devices support a working mode called Low-latency -Queue (LLQ), which saves several more microseconds. - -Supported PCI vendor ID/device IDs: -=================================== -1d0f:0ec2 - ENA PF -1d0f:1ec2 - ENA PF with LLQ support -1d0f:ec20 - ENA VF -1d0f:ec21 - ENA VF with LLQ support - -ENA Source Code Directory Structure: -==================================== -ena_com.[ch] - Management communication layer. This layer is - responsible for the handling all the management - (admin) communication between the device and the - driver. -ena_eth_com.[ch] - Tx/Rx data path. -ena_admin_defs.h - Definition of ENA management interface. -ena_eth_io_defs.h - Definition of ENA data path interface. -ena_common_defs.h - Common definitions for ena_com layer. -ena_regs_defs.h - Definition of ENA PCI memory-mapped (MMIO) registers. -ena_netdev.[ch] - Main Linux kernel driver. -ena_syfsfs.[ch] - Sysfs files. -ena_ethtool.c - ethtool callbacks. -ena_pci_id_tbl.h - Supported device IDs. - -Management Interface: -===================== -ENA management interface is exposed by means of: -- PCIe Configuration Space -- Device Registers -- Admin Queue (AQ) and Admin Completion Queue (ACQ) -- Asynchronous Event Notification Queue (AENQ) - -ENA device MMIO Registers are accessed only during driver -initialization and are not involved in further normal device -operation. - -AQ is used for submitting management commands, and the -results/responses are reported asynchronously through ACQ. - -ENA introduces a small set of management commands with room for -vendor-specific extensions. Most of the management operations are -framed in a generic Get/Set feature command. - -The following admin queue commands are supported: -- Create I/O submission queue -- Create I/O completion queue -- Destroy I/O submission queue -- Destroy I/O completion queue -- Get feature -- Set feature -- Configure AENQ -- Get statistics - -Refer to ena_admin_defs.h for the list of supported Get/Set Feature -properties. - -The Asynchronous Event Notification Queue (AENQ) is a uni-directional -queue used by the ENA device to send to the driver events that cannot -be reported using ACQ. AENQ events are subdivided into groups. Each -group may have multiple syndromes, as shown below - -The events are: - Group Syndrome - Link state change - X - - Fatal error - X - - Notification Suspend traffic - Notification Resume traffic - Keep-Alive - X - - -ACQ and AENQ share the same MSI-X vector. - -Keep-Alive is a special mechanism that allows monitoring of the -device's health. The driver maintains a watchdog (WD) handler which, -if fired, logs the current state and statistics then resets and -restarts the ENA device and driver. A Keep-Alive event is delivered by -the device every second. The driver re-arms the WD upon reception of a -Keep-Alive event. A missed Keep-Alive event causes the WD handler to -fire. - -Data Path Interface: -==================== -I/O operations are based on Tx and Rx Submission Queues (Tx SQ and Rx -SQ correspondingly). Each SQ has a completion queue (CQ) associated -with it. - -The SQs and CQs are implemented as descriptor rings in contiguous -physical memory. - -The ENA driver supports two Queue Operation modes for Tx SQs: -- Regular mode - * In this mode the Tx SQs reside in the host's memory. The ENA - device fetches the ENA Tx descriptors and packet data from host - memory. -- Low Latency Queue (LLQ) mode or "push-mode". - * In this mode the driver pushes the transmit descriptors and the - first 128 bytes of the packet directly to the ENA device memory - space. The rest of the packet payload is fetched by the - device. For this operation mode, the driver uses a dedicated PCI - device memory BAR, which is mapped with write-combine capability. - -The Rx SQs support only the regular mode. - -Note: Not all ENA devices support LLQ, and this feature is negotiated - with the device upon initialization. If the ENA device does not - support LLQ mode, the driver falls back to the regular mode. - -The driver supports multi-queue for both Tx and Rx. This has various -benefits: -- Reduced CPU/thread/process contention on a given Ethernet interface. -- Cache miss rate on completion is reduced, particularly for data - cache lines that hold the sk_buff structures. -- Increased process-level parallelism when handling received packets. -- Increased data cache hit rate, by steering kernel processing of - packets to the CPU, where the application thread consuming the - packet is running. -- In hardware interrupt re-direction. - -Interrupt Modes: -================ -The driver assigns a single MSI-X vector per queue pair (for both Tx -and Rx directions). The driver assigns an additional dedicated MSI-X vector -for management (for ACQ and AENQ). - -Management interrupt registration is performed when the Linux kernel -probes the adapter, and it is de-registered when the adapter is -removed. I/O queue interrupt registration is performed when the Linux -interface of the adapter is opened, and it is de-registered when the -interface is closed. - -The management interrupt is named: - ena-mgmnt@pci: -and for each queue pair, an interrupt is named: - -Tx-Rx- - -The ENA device operates in auto-mask and auto-clear interrupt -modes. That is, once MSI-X is delivered to the host, its Cause bit is -automatically cleared and the interrupt is masked. The interrupt is -unmasked by the driver after NAPI processing is complete. - -Interrupt Moderation: -===================== -ENA driver and device can operate in conventional or adaptive interrupt -moderation mode. - -In conventional mode the driver instructs device to postpone interrupt -posting according to static interrupt delay value. The interrupt delay -value can be configured through ethtool(8). The following ethtool -parameters are supported by the driver: tx-usecs, rx-usecs - -In adaptive interrupt moderation mode the interrupt delay value is -updated by the driver dynamically and adjusted every NAPI cycle -according to the traffic nature. - -By default ENA driver applies adaptive coalescing on Rx traffic and -conventional coalescing on Tx traffic. - -Adaptive coalescing can be switched on/off through ethtool(8) -adaptive_rx on|off parameter. - -The driver chooses interrupt delay value according to the number of -bytes and packets received between interrupt unmasking and interrupt -posting. The driver uses interrupt delay table that subdivides the -range of received bytes/packets into 5 levels and assigns interrupt -delay value to each level. - -The user can enable/disable adaptive moderation, modify the interrupt -delay table and restore its default values through sysfs. - -RX copybreak: -============= -The rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK -and can be configured by the ETHTOOL_STUNABLE command of the -SIOCETHTOOL ioctl. - -SKB: -==== -The driver-allocated SKB for frames received from Rx handling using -NAPI context. The allocation method depends on the size of the packet. -If the frame length is larger than rx_copybreak, napi_get_frags() -is used, otherwise netdev_alloc_skb_ip_align() is used, the buffer -content is copied (by CPU) to the SKB, and the buffer is recycled. - -Statistics: -=========== -The user can obtain ENA device and driver statistics using ethtool. -The driver can collect regular or extended statistics (including -per-queue stats) from the device. - -In addition the driver logs the stats to syslog upon device reset. - -MTU: -==== -The driver supports an arbitrarily large MTU with a maximum that is -negotiated with the device. The driver configures MTU using the -SetFeature command (ENA_ADMIN_MTU property). The user can change MTU -via ip(8) and similar legacy tools. - -Stateless Offloads: -=================== -The ENA driver supports: -- TSO over IPv4/IPv6 -- TSO with ECN -- IPv4 header checksum offload -- TCP/UDP over IPv4/IPv6 checksum offloads - -RSS: -==== -- The ENA device supports RSS that allows flexible Rx traffic - steering. -- Toeplitz and CRC32 hash functions are supported. -- Different combinations of L2/L3/L4 fields can be configured as - inputs for hash functions. -- The driver configures RSS settings using the AQ SetFeature command - (ENA_ADMIN_RSS_HASH_FUNCTION, ENA_ADMIN_RSS_HASH_INPUT and - ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG properties). -- If the NETIF_F_RXHASH flag is set, the 32-bit result of the hash - function delivered in the Rx CQ descriptor is set in the received - SKB. -- The user can provide a hash key, hash function, and configure the - indirection table through ethtool(8). - -DATA PATH: -========== -Tx: ---- -end_start_xmit() is called by the stack. This function does the following: -- Maps data buffers (skb->data and frags). -- Populates ena_buf for the push buffer (if the driver and device are - in push mode.) -- Prepares ENA bufs for the remaining frags. -- Allocates a new request ID from the empty req_id ring. The request - ID is the index of the packet in the Tx info. This is used for - out-of-order TX completions. -- Adds the packet to the proper place in the Tx ring. -- Calls ena_com_prepare_tx(), an ENA communication layer that converts - the ena_bufs to ENA descriptors (and adds meta ENA descriptors as - needed.) - * This function also copies the ENA descriptors and the push buffer - to the Device memory space (if in push mode.) -- Writes doorbell to the ENA device. -- When the ENA device finishes sending the packet, a completion - interrupt is raised. -- The interrupt handler schedules NAPI. -- The ena_clean_tx_irq() function is called. This function handles the - completion descriptors generated by the ENA, with a single - completion descriptor per completed packet. - * req_id is retrieved from the completion descriptor. The tx_info of - the packet is retrieved via the req_id. The data buffers are - unmapped and req_id is returned to the empty req_id ring. - * The function stops when the completion descriptors are completed or - the budget is reached. - -Rx: ---- -- When a packet is received from the ENA device. -- The interrupt handler schedules NAPI. -- The ena_clean_rx_irq() function is called. This function calls - ena_rx_pkt(), an ENA communication layer function, which returns the - number of descriptors used for a new unhandled packet, and zero if - no new packet is found. -- Then it calls the ena_clean_rx_irq() function. -- ena_eth_rx_skb() checks packet length: - * If the packet is small (len < rx_copybreak), the driver allocates - a SKB for the new packet, and copies the packet payload into the - SKB data buffer. - - In this way the original data buffer is not passed to the stack - and is reused for future Rx packets. - * Otherwise the function unmaps the Rx buffer, then allocates the - new SKB structure and hooks the Rx buffer to the SKB frags. -- The new SKB is updated with the necessary information (protocol, - checksum hw verify result, etc.), and then passed to the network - stack, using the NAPI interface function napi_gro_receive(). diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst index aaac502b81ea..019a0d2efe67 100644 --- a/Documentation/networking/device_drivers/index.rst +++ b/Documentation/networking/device_drivers/index.rst @@ -29,6 +29,7 @@ Contents: stmicro/stmmac 3com/3c509 3com/vortex + amazon/ena .. only:: subproject and html diff --git a/MAINTAINERS b/MAINTAINERS index eaea5f1994c9..7b6c13cc832f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -815,7 +815,7 @@ R: Saeed Bishara R: Zorik Machulsky L: netdev@vger.kernel.org S: Supported -F: Documentation/networking/device_drivers/amazon/ena.txt +F: Documentation/networking/device_drivers/amazon/ena.rst F: drivers/net/ethernet/amazon/ AMAZON RDMA EFA DRIVER -- cgit v1.2.3 From c958119a487ec4578f50b352f45e965a30daa020 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:37 +0200 Subject: docs: networking: device drivers: convert aquantia/atlantic.txt to ReST - add SPDX header; - use copyright symbol; - adjust title and its markup; - comment out text-only TOC from html/pdf output; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- .../device_drivers/aquantia/atlantic.rst | 556 +++++++++++++++++++++ .../device_drivers/aquantia/atlantic.txt | 479 ------------------ Documentation/networking/device_drivers/index.rst | 1 + MAINTAINERS | 2 +- 4 files changed, 558 insertions(+), 480 deletions(-) create mode 100644 Documentation/networking/device_drivers/aquantia/atlantic.rst delete mode 100644 Documentation/networking/device_drivers/aquantia/atlantic.txt (limited to 'Documentation') diff --git a/Documentation/networking/device_drivers/aquantia/atlantic.rst b/Documentation/networking/device_drivers/aquantia/atlantic.rst new file mode 100644 index 000000000000..595ddef1c8b3 --- /dev/null +++ b/Documentation/networking/device_drivers/aquantia/atlantic.rst @@ -0,0 +1,556 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +=============================== +Marvell(Aquantia) AQtion Driver +=============================== + +For the aQuantia Multi-Gigabit PCI Express Family of Ethernet Adapters + +.. Contents + + - Identifying Your Adapter + - Configuration + - Supported ethtool options + - Command Line Parameters + - Config file parameters + - Support + - License + +Identifying Your Adapter +======================== + +The driver in this release is compatible with AQC-100, AQC-107, AQC-108 +based ethernet adapters. + + +SFP+ Devices (for AQC-100 based adapters) +----------------------------------------- + +This release tested with passive Direct Attach Cables (DAC) and SFP+/LC +Optical Transceiver. + +Configuration +============= + +Viewing Link Messages +--------------------- + Link messages will not be displayed to the console if the distribution is + restricting system messages. In order to see network driver link messages on + your console, set dmesg to eight by entering the following:: + + dmesg -n 8 + + .. note:: + + This setting is not saved across reboots. + +Jumbo Frames +------------ + The driver supports Jumbo Frames for all adapters. Jumbo Frames support is + enabled by changing the MTU to a value larger than the default of 1500. + The maximum value for the MTU is 16000. Use the `ip` command to + increase the MTU size. For example:: + + ip link set mtu 16000 dev enp1s0 + +ethtool +------- + The driver utilizes the ethtool interface for driver configuration and + diagnostics, as well as displaying statistical information. The latest + ethtool version is required for this functionality. + +NAPI +---- + NAPI (Rx polling mode) is supported in the atlantic driver. + +Supported ethtool options +========================= + +Viewing adapter settings +------------------------ + + :: + + ethtool + + Output example:: + + Settings for enp1s0: + Supported ports: [ TP ] + Supported link modes: 100baseT/Full + 1000baseT/Full + 10000baseT/Full + 2500baseT/Full + 5000baseT/Full + Supported pause frame use: Symmetric + Supports auto-negotiation: Yes + Supported FEC modes: Not reported + Advertised link modes: 100baseT/Full + 1000baseT/Full + 10000baseT/Full + 2500baseT/Full + 5000baseT/Full + Advertised pause frame use: Symmetric + Advertised auto-negotiation: Yes + Advertised FEC modes: Not reported + Speed: 10000Mb/s + Duplex: Full + Port: Twisted Pair + PHYAD: 0 + Transceiver: internal + Auto-negotiation: on + MDI-X: Unknown + Supports Wake-on: g + Wake-on: d + Link detected: yes + + + .. note:: + + AQrate speeds (2.5/5 Gb/s) will be displayed only with linux kernels > 4.10. + But you can still use these speeds:: + + ethtool -s eth0 autoneg off speed 2500 + +Viewing adapter information +--------------------------- + + :: + + ethtool -i + + Output example:: + + driver: atlantic + version: 5.2.0-050200rc5-generic-kern + firmware-version: 3.1.78 + expansion-rom-version: + bus-info: 0000:01:00.0 + supports-statistics: yes + supports-test: no + supports-eeprom-access: no + supports-register-dump: yes + supports-priv-flags: no + + +Viewing Ethernet adapter statistics +----------------------------------- + + :: + + ethtool -S + + Output example:: + + NIC statistics: + InPackets: 13238607 + InUCast: 13293852 + InMCast: 52 + InBCast: 3 + InErrors: 0 + OutPackets: 23703019 + OutUCast: 23704941 + OutMCast: 67 + OutBCast: 11 + InUCastOctects: 213182760 + OutUCastOctects: 22698443 + InMCastOctects: 6600 + OutMCastOctects: 8776 + InBCastOctects: 192 + OutBCastOctects: 704 + InOctects: 2131839552 + OutOctects: 226938073 + InPacketsDma: 95532300 + OutPacketsDma: 59503397 + InOctetsDma: 1137102462 + OutOctetsDma: 2394339518 + InDroppedDma: 0 + Queue[0] InPackets: 23567131 + Queue[0] OutPackets: 20070028 + Queue[0] InJumboPackets: 0 + Queue[0] InLroPackets: 0 + Queue[0] InErrors: 0 + Queue[1] InPackets: 45428967 + Queue[1] OutPackets: 11306178 + Queue[1] InJumboPackets: 0 + Queue[1] InLroPackets: 0 + Queue[1] InErrors: 0 + Queue[2] InPackets: 3187011 + Queue[2] OutPackets: 13080381 + Queue[2] InJumboPackets: 0 + Queue[2] InLroPackets: 0 + Queue[2] InErrors: 0 + Queue[3] InPackets: 23349136 + Queue[3] OutPackets: 15046810 + Queue[3] InJumboPackets: 0 + Queue[3] InLroPackets: 0 + Queue[3] InErrors: 0 + +Interrupt coalescing support +---------------------------- + + ITR mode, TX/RX coalescing timings could be viewed with:: + + ethtool -c + + and changed with:: + + ethtool -C tx-usecs rx-usecs + + To disable coalescing:: + + ethtool -C tx-usecs 0 rx-usecs 0 tx-max-frames 1 tx-max-frames 1 + +Wake on LAN support +------------------- + + WOL support by magic packet:: + + ethtool -s wol g + + To disable WOL:: + + ethtool -s wol d + +Set and check the driver message level +-------------------------------------- + + Set message level + + :: + + ethtool -s msglvl + + Level values: + + ====== ============================= + 0x0001 general driver status. + 0x0002 hardware probing. + 0x0004 link state. + 0x0008 periodic status check. + 0x0010 interface being brought down. + 0x0020 interface being brought up. + 0x0040 receive error. + 0x0080 transmit error. + 0x0200 interrupt handling. + 0x0400 transmit completion. + 0x0800 receive completion. + 0x1000 packet contents. + 0x2000 hardware status. + 0x4000 Wake-on-LAN status. + ====== ============================= + + By default, the level of debugging messages is set 0x0001(general driver status). + + Check message level + + :: + + ethtool | grep "Current message level" + + If you want to disable the output of messages:: + + ethtool -s msglvl 0 + +RX flow rules (ntuple filters) +------------------------------ + + There are separate rules supported, that applies in that order: + + 1. 16 VLAN ID rules + 2. 16 L2 EtherType rules + 3. 8 L3/L4 5-Tuple rules + + + The driver utilizes the ethtool interface for configuring ntuple filters, + via ``ethtool -N ``. + + To enable or disable the RX flow rules:: + + ethtool -K ethX ntuple + + When disabling ntuple filters, all the user programed filters are + flushed from the driver cache and hardware. All needed filters must + be re-added when ntuple is re-enabled. + + Because of the fixed order of the rules, the location of filters is also fixed: + + - Locations 0 - 15 for VLAN ID filters + - Locations 16 - 31 for L2 EtherType filters + - Locations 32 - 39 for L3/L4 5-tuple filters (locations 32, 36 for IPv6) + + The L3/L4 5-tuple (protocol, source and destination IP address, source and + destination TCP/UDP/SCTP port) is compared against 8 filters. For IPv4, up to + 8 source and destination addresses can be matched. For IPv6, up to 2 pairs of + addresses can be supported. Source and destination ports are only compared for + TCP/UDP/SCTP packets. + + To add a filter that directs packet to queue 5, use + ``<-N|-U|--config-nfc|--config-ntuple>`` switch:: + + ethtool -N flow-type udp4 src-ip 10.0.0.1 dst-ip 10.0.0.2 src-port 2000 dst-port 2001 action 5 + + - action is the queue number. + - loc is the rule number. + + For ``flow-type ip4|udp4|tcp4|sctp4|ip6|udp6|tcp6|sctp6`` you must set the loc + number within 32 - 39. + For ``flow-type ip4|udp4|tcp4|sctp4|ip6|udp6|tcp6|sctp6`` you can set 8 rules + for traffic IPv4 or you can set 2 rules for traffic IPv6. Loc number traffic + IPv6 is 32 and 36. + At the moment you can not use IPv4 and IPv6 filters at the same time. + + Example filter for IPv6 filter traffic:: + + sudo ethtool -N flow-type tcp6 src-ip 2001:db8:0:f101::1 dst-ip 2001:db8:0:f101::2 action 1 loc 32 + sudo ethtool -N flow-type ip6 src-ip 2001:db8:0:f101::2 dst-ip 2001:db8:0:f101::5 action -1 loc 36 + + Example filter for IPv4 filter traffic:: + + sudo ethtool -N flow-type udp4 src-ip 10.0.0.4 dst-ip 10.0.0.7 src-port 2000 dst-port 2001 loc 32 + sudo ethtool -N flow-type tcp4 src-ip 10.0.0.3 dst-ip 10.0.0.9 src-port 2000 dst-port 2001 loc 33 + sudo ethtool -N flow-type ip4 src-ip 10.0.0.6 dst-ip 10.0.0.4 loc 34 + + If you set action -1, then all traffic corresponding to the filter will be discarded. + + The maximum value action is 31. + + + The VLAN filter (VLAN id) is compared against 16 filters. + VLAN id must be accompanied by mask 0xF000. That is to distinguish VLAN filter + from L2 Ethertype filter with UserPriority since both User Priority and VLAN ID + are passed in the same 'vlan' parameter. + + To add a filter that directs packets from VLAN 2001 to queue 5:: + + ethtool -N flow-type ip4 vlan 2001 m 0xF000 action 1 loc 0 + + + L2 EtherType filters allows filter packet by EtherType field or both EtherType + and User Priority (PCP) field of 802.1Q. + UserPriority (vlan) parameter must be accompanied by mask 0x1FFF. That is to + distinguish VLAN filter from L2 Ethertype filter with UserPriority since both + User Priority and VLAN ID are passed in the same 'vlan' parameter. + + To add a filter that directs IP4 packess of priority 3 to queue 3:: + + ethtool -N flow-type ether proto 0x800 vlan 0x600 m 0x1FFF action 3 loc 16 + + To see the list of filters currently present:: + + ethtool <-u|-n|--show-nfc|--show-ntuple> + + Rules may be deleted from the table itself. This is done using:: + + sudo ethtool <-N|-U|--config-nfc|--config-ntuple> delete + + - loc is the rule number to be deleted. + + Rx filters is an interface to load the filter table that funnels all flow + into queue 0 unless an alternative queue is specified using "action". In that + case, any flow that matches the filter criteria will be directed to the + appropriate queue. RX filters is supported on all kernels 2.6.30 and later. + +RSS for UDP +----------- + + Currently, NIC does not support RSS for fragmented IP packets, which leads to + incorrect working of RSS for fragmented UDP traffic. To disable RSS for UDP the + RX Flow L3/L4 rule may be used. + + Example:: + + ethtool -N eth0 flow-type udp4 action 0 loc 32 + +UDP GSO hardware offload +------------------------ + + UDP GSO allows to boost UDP tx rates by offloading UDP headers allocation + into hardware. A special userspace socket option is required for this, + could be validated with /kernel/tools/testing/selftests/net/:: + + udpgso_bench_tx -u -4 -D 10.0.1.1 -s 6300 -S 100 + + Will cause sending out of 100 byte sized UDP packets formed from single + 6300 bytes user buffer. + + UDP GSO is configured by:: + + ethtool -K eth0 tx-udp-segmentation on + +Private flags (testing) +----------------------- + + Atlantic driver supports private flags for hardware custom features:: + + $ ethtool --show-priv-flags ethX + + Private flags for ethX: + DMASystemLoopback : off + PKTSystemLoopback : off + DMANetworkLoopback : off + PHYInternalLoopback: off + PHYExternalLoopback: off + + Example:: + + $ ethtool --set-priv-flags ethX DMASystemLoopback on + + DMASystemLoopback: DMA Host loopback. + PKTSystemLoopback: Packet buffer host loopback. + DMANetworkLoopback: Network side loopback on DMA block. + PHYInternalLoopback: Internal loopback on Phy. + PHYExternalLoopback: External loopback on Phy (with loopback ethernet cable). + + +Command Line Parameters +======================= +The following command line parameters are available on atlantic driver: + +aq_itr -Interrupt throttling mode +--------------------------------- +Accepted values: 0, 1, 0xFFFF + +Default value: 0xFFFF + +====== ============================================================== +0 Disable interrupt throttling. +1 Enable interrupt throttling and use specified tx and rx rates. +0xFFFF Auto throttling mode. Driver will choose the best RX and TX + interrupt throtting settings based on link speed. +====== ============================================================== + +aq_itr_tx - TX interrupt throttle rate +-------------------------------------- + +Accepted values: 0 - 0x1FF + +Default value: 0 + +TX side throttling in microseconds. Adapter will setup maximum interrupt delay +to this value. Minimum interrupt delay will be a half of this value + +aq_itr_rx - RX interrupt throttle rate +-------------------------------------- + +Accepted values: 0 - 0x1FF + +Default value: 0 + +RX side throttling in microseconds. Adapter will setup maximum interrupt delay +to this value. Minimum interrupt delay will be a half of this value + +.. note:: + + ITR settings could be changed in runtime by ethtool -c means (see below) + +Config file parameters +====================== + +For some fine tuning and performance optimizations, +some parameters can be changed in the {source_dir}/aq_cfg.h file. + +AQ_CFG_RX_PAGEORDER +------------------- + +Default value: 0 + +RX page order override. Thats a power of 2 number of RX pages allocated for +each descriptor. Received descriptor size is still limited by +AQ_CFG_RX_FRAME_MAX. + +Increasing pageorder makes page reuse better (actual on iommu enabled systems). + +AQ_CFG_RX_REFILL_THRES +---------------------- + +Default value: 32 + +RX refill threshold. RX path will not refill freed descriptors until the +specified number of free descriptors is observed. Larger values may help +better page reuse but may lead to packet drops as well. + +AQ_CFG_VECS_DEF +--------------- + +Number of queues + +Valid Range: 0 - 8 (up to AQ_CFG_VECS_MAX) + +Default value: 8 + +Notice this value will be capped by the number of cores available on the system. + +AQ_CFG_IS_RSS_DEF +----------------- + +Enable/disable Receive Side Scaling + +This feature allows the adapter to distribute receive processing +across multiple CPU-cores and to prevent from overloading a single CPU core. + +Valid values + +== ======== +0 disabled +1 enabled +== ======== + +Default value: 1 + +AQ_CFG_NUM_RSS_QUEUES_DEF +------------------------- + +Number of queues for Receive Side Scaling + +Valid Range: 0 - 8 (up to AQ_CFG_VECS_DEF) + +Default value: AQ_CFG_VECS_DEF + +AQ_CFG_IS_LRO_DEF +----------------- + +Enable/disable Large Receive Offload + +This offload enables the adapter to coalesce multiple TCP segments and indicate +them as a single coalesced unit to the OS networking subsystem. + +The system consumes less energy but it also introduces more latency in packets +processing. + +Valid values + +== ======== +0 disabled +1 enabled +== ======== + +Default value: 1 + +AQ_CFG_TX_CLEAN_BUDGET +---------------------- + +Maximum descriptors to cleanup on TX at once. + +Default value: 256 + +After the aq_cfg.h file changed the driver must be rebuilt to take effect. + +Support +======= + +If an issue is identified with the released source code on the supported +kernel with a supported adapter, email the specific information related +to the issue to aqn_support@marvell.com + +License +======= + +aQuantia Corporation Network Driver + +Copyright |copy| 2014 - 2019 aQuantia Corporation. + +This program is free software; you can redistribute it and/or modify it +under the terms and conditions of the GNU General Public License, +version 2, as published by the Free Software Foundation. diff --git a/Documentation/networking/device_drivers/aquantia/atlantic.txt b/Documentation/networking/device_drivers/aquantia/atlantic.txt deleted file mode 100644 index 2013fcedc2da..000000000000 --- a/Documentation/networking/device_drivers/aquantia/atlantic.txt +++ /dev/null @@ -1,479 +0,0 @@ -Marvell(Aquantia) AQtion Driver for the aQuantia Multi-Gigabit PCI Express -Family of Ethernet Adapters -============================================================================= - -Contents -======== - -- Identifying Your Adapter -- Configuration -- Supported ethtool options -- Command Line Parameters -- Config file parameters -- Support -- License - -Identifying Your Adapter -======================== - -The driver in this release is compatible with AQC-100, AQC-107, AQC-108 based ethernet adapters. - - -SFP+ Devices (for AQC-100 based adapters) ----------------------------------- - -This release tested with passive Direct Attach Cables (DAC) and SFP+/LC Optical Transceiver. - -Configuration -========================= - Viewing Link Messages - --------------------- - Link messages will not be displayed to the console if the distribution is - restricting system messages. In order to see network driver link messages on - your console, set dmesg to eight by entering the following: - - dmesg -n 8 - - NOTE: This setting is not saved across reboots. - - Jumbo Frames - ------------ - The driver supports Jumbo Frames for all adapters. Jumbo Frames support is - enabled by changing the MTU to a value larger than the default of 1500. - The maximum value for the MTU is 16000. Use the `ip` command to - increase the MTU size. For example: - - ip link set mtu 16000 dev enp1s0 - - ethtool - ------- - The driver utilizes the ethtool interface for driver configuration and - diagnostics, as well as displaying statistical information. The latest - ethtool version is required for this functionality. - - NAPI - ---- - NAPI (Rx polling mode) is supported in the atlantic driver. - -Supported ethtool options -============================ - Viewing adapter settings - --------------------- - ethtool - - Output example: - - Settings for enp1s0: - Supported ports: [ TP ] - Supported link modes: 100baseT/Full - 1000baseT/Full - 10000baseT/Full - 2500baseT/Full - 5000baseT/Full - Supported pause frame use: Symmetric - Supports auto-negotiation: Yes - Supported FEC modes: Not reported - Advertised link modes: 100baseT/Full - 1000baseT/Full - 10000baseT/Full - 2500baseT/Full - 5000baseT/Full - Advertised pause frame use: Symmetric - Advertised auto-negotiation: Yes - Advertised FEC modes: Not reported - Speed: 10000Mb/s - Duplex: Full - Port: Twisted Pair - PHYAD: 0 - Transceiver: internal - Auto-negotiation: on - MDI-X: Unknown - Supports Wake-on: g - Wake-on: d - Link detected: yes - - --- - Note: AQrate speeds (2.5/5 Gb/s) will be displayed only with linux kernels > 4.10. - But you can still use these speeds: - ethtool -s eth0 autoneg off speed 2500 - - Viewing adapter information - --------------------- - ethtool -i - - Output example: - - driver: atlantic - version: 5.2.0-050200rc5-generic-kern - firmware-version: 3.1.78 - expansion-rom-version: - bus-info: 0000:01:00.0 - supports-statistics: yes - supports-test: no - supports-eeprom-access: no - supports-register-dump: yes - supports-priv-flags: no - - - Viewing Ethernet adapter statistics: - --------------------- - ethtool -S - - Output example: - NIC statistics: - InPackets: 13238607 - InUCast: 13293852 - InMCast: 52 - InBCast: 3 - InErrors: 0 - OutPackets: 23703019 - OutUCast: 23704941 - OutMCast: 67 - OutBCast: 11 - InUCastOctects: 213182760 - OutUCastOctects: 22698443 - InMCastOctects: 6600 - OutMCastOctects: 8776 - InBCastOctects: 192 - OutBCastOctects: 704 - InOctects: 2131839552 - OutOctects: 226938073 - InPacketsDma: 95532300 - OutPacketsDma: 59503397 - InOctetsDma: 1137102462 - OutOctetsDma: 2394339518 - InDroppedDma: 0 - Queue[0] InPackets: 23567131 - Queue[0] OutPackets: 20070028 - Queue[0] InJumboPackets: 0 - Queue[0] InLroPackets: 0 - Queue[0] InErrors: 0 - Queue[1] InPackets: 45428967 - Queue[1] OutPackets: 11306178 - Queue[1] InJumboPackets: 0 - Queue[1] InLroPackets: 0 - Queue[1] InErrors: 0 - Queue[2] InPackets: 3187011 - Queue[2] OutPackets: 13080381 - Queue[2] InJumboPackets: 0 - Queue[2] InLroPackets: 0 - Queue[2] InErrors: 0 - Queue[3] InPackets: 23349136 - Queue[3] OutPackets: 15046810 - Queue[3] InJumboPackets: 0 - Queue[3] InLroPackets: 0 - Queue[3] InErrors: 0 - - Interrupt coalescing support - --------------------------------- - ITR mode, TX/RX coalescing timings could be viewed with: - - ethtool -c - - and changed with: - - ethtool -C tx-usecs rx-usecs - - To disable coalescing: - - ethtool -C tx-usecs 0 rx-usecs 0 tx-max-frames 1 tx-max-frames 1 - - Wake on LAN support - --------------------------------- - - WOL support by magic packet: - - ethtool -s wol g - - To disable WOL: - - ethtool -s wol d - - Set and check the driver message level - --------------------------------- - - Set message level - - ethtool -s msglvl - - Level values: - - 0x0001 - general driver status. - 0x0002 - hardware probing. - 0x0004 - link state. - 0x0008 - periodic status check. - 0x0010 - interface being brought down. - 0x0020 - interface being brought up. - 0x0040 - receive error. - 0x0080 - transmit error. - 0x0200 - interrupt handling. - 0x0400 - transmit completion. - 0x0800 - receive completion. - 0x1000 - packet contents. - 0x2000 - hardware status. - 0x4000 - Wake-on-LAN status. - - By default, the level of debugging messages is set 0x0001(general driver status). - - Check message level - - ethtool | grep "Current message level" - - If you want to disable the output of messages - - ethtool -s msglvl 0 - - RX flow rules (ntuple filters) - --------------------------------- - There are separate rules supported, that applies in that order: - 1. 16 VLAN ID rules - 2. 16 L2 EtherType rules - 3. 8 L3/L4 5-Tuple rules - - - The driver utilizes the ethtool interface for configuring ntuple filters, - via "ethtool -N ". - - To enable or disable the RX flow rules: - - ethtool -K ethX ntuple - - When disabling ntuple filters, all the user programed filters are - flushed from the driver cache and hardware. All needed filters must - be re-added when ntuple is re-enabled. - - Because of the fixed order of the rules, the location of filters is also fixed: - - Locations 0 - 15 for VLAN ID filters - - Locations 16 - 31 for L2 EtherType filters - - Locations 32 - 39 for L3/L4 5-tuple filters (locations 32, 36 for IPv6) - - The L3/L4 5-tuple (protocol, source and destination IP address, source and - destination TCP/UDP/SCTP port) is compared against 8 filters. For IPv4, up to - 8 source and destination addresses can be matched. For IPv6, up to 2 pairs of - addresses can be supported. Source and destination ports are only compared for - TCP/UDP/SCTP packets. - - To add a filter that directs packet to queue 5, use <-N|-U|--config-nfc|--config-ntuple> switch: - - ethtool -N flow-type udp4 src-ip 10.0.0.1 dst-ip 10.0.0.2 src-port 2000 dst-port 2001 action 5 - - - action is the queue number. - - loc is the rule number. - - For "flow-type ip4|udp4|tcp4|sctp4|ip6|udp6|tcp6|sctp6" you must set the loc - number within 32 - 39. - For "flow-type ip4|udp4|tcp4|sctp4|ip6|udp6|tcp6|sctp6" you can set 8 rules - for traffic IPv4 or you can set 2 rules for traffic IPv6. Loc number traffic - IPv6 is 32 and 36. - At the moment you can not use IPv4 and IPv6 filters at the same time. - - Example filter for IPv6 filter traffic: - - sudo ethtool -N flow-type tcp6 src-ip 2001:db8:0:f101::1 dst-ip 2001:db8:0:f101::2 action 1 loc 32 - sudo ethtool -N flow-type ip6 src-ip 2001:db8:0:f101::2 dst-ip 2001:db8:0:f101::5 action -1 loc 36 - - Example filter for IPv4 filter traffic: - - sudo ethtool -N flow-type udp4 src-ip 10.0.0.4 dst-ip 10.0.0.7 src-port 2000 dst-port 2001 loc 32 - sudo ethtool -N flow-type tcp4 src-ip 10.0.0.3 dst-ip 10.0.0.9 src-port 2000 dst-port 2001 loc 33 - sudo ethtool -N flow-type ip4 src-ip 10.0.0.6 dst-ip 10.0.0.4 loc 34 - - If you set action -1, then all traffic corresponding to the filter will be discarded. - The maximum value action is 31. - - - The VLAN filter (VLAN id) is compared against 16 filters. - VLAN id must be accompanied by mask 0xF000. That is to distinguish VLAN filter - from L2 Ethertype filter with UserPriority since both User Priority and VLAN ID - are passed in the same 'vlan' parameter. - - To add a filter that directs packets from VLAN 2001 to queue 5: - ethtool -N flow-type ip4 vlan 2001 m 0xF000 action 1 loc 0 - - - L2 EtherType filters allows filter packet by EtherType field or both EtherType - and User Priority (PCP) field of 802.1Q. - UserPriority (vlan) parameter must be accompanied by mask 0x1FFF. That is to - distinguish VLAN filter from L2 Ethertype filter with UserPriority since both - User Priority and VLAN ID are passed in the same 'vlan' parameter. - - To add a filter that directs IP4 packess of priority 3 to queue 3: - ethtool -N flow-type ether proto 0x800 vlan 0x600 m 0x1FFF action 3 loc 16 - - - To see the list of filters currently present: - - ethtool <-u|-n|--show-nfc|--show-ntuple> - - Rules may be deleted from the table itself. This is done using: - - sudo ethtool <-N|-U|--config-nfc|--config-ntuple> delete - - - loc is the rule number to be deleted. - - Rx filters is an interface to load the filter table that funnels all flow - into queue 0 unless an alternative queue is specified using "action". In that - case, any flow that matches the filter criteria will be directed to the - appropriate queue. RX filters is supported on all kernels 2.6.30 and later. - - RSS for UDP - --------------------------------- - Currently, NIC does not support RSS for fragmented IP packets, which leads to - incorrect working of RSS for fragmented UDP traffic. To disable RSS for UDP the - RX Flow L3/L4 rule may be used. - - Example: - ethtool -N eth0 flow-type udp4 action 0 loc 32 - - UDP GSO hardware offload - --------------------------------- - UDP GSO allows to boost UDP tx rates by offloading UDP headers allocation - into hardware. A special userspace socket option is required for this, - could be validated with /kernel/tools/testing/selftests/net/ - - udpgso_bench_tx -u -4 -D 10.0.1.1 -s 6300 -S 100 - - Will cause sending out of 100 byte sized UDP packets formed from single - 6300 bytes user buffer. - - UDP GSO is configured by: - - ethtool -K eth0 tx-udp-segmentation on - - Private flags (testing) - --------------------------------- - - Atlantic driver supports private flags for hardware custom features: - - $ ethtool --show-priv-flags ethX - - Private flags for ethX: - DMASystemLoopback : off - PKTSystemLoopback : off - DMANetworkLoopback : off - PHYInternalLoopback: off - PHYExternalLoopback: off - - Example: - - $ ethtool --set-priv-flags ethX DMASystemLoopback on - - DMASystemLoopback: DMA Host loopback. - PKTSystemLoopback: Packet buffer host loopback. - DMANetworkLoopback: Network side loopback on DMA block. - PHYInternalLoopback: Internal loopback on Phy. - PHYExternalLoopback: External loopback on Phy (with loopback ethernet cable). - - -Command Line Parameters -======================= -The following command line parameters are available on atlantic driver: - -aq_itr -Interrupt throttling mode ----------------------------------------- -Accepted values: 0, 1, 0xFFFF -Default value: 0xFFFF -0 - Disable interrupt throttling. -1 - Enable interrupt throttling and use specified tx and rx rates. -0xFFFF - Auto throttling mode. Driver will choose the best RX and TX - interrupt throtting settings based on link speed. - -aq_itr_tx - TX interrupt throttle rate ----------------------------------------- -Accepted values: 0 - 0x1FF -Default value: 0 -TX side throttling in microseconds. Adapter will setup maximum interrupt delay -to this value. Minimum interrupt delay will be a half of this value - -aq_itr_rx - RX interrupt throttle rate ----------------------------------------- -Accepted values: 0 - 0x1FF -Default value: 0 -RX side throttling in microseconds. Adapter will setup maximum interrupt delay -to this value. Minimum interrupt delay will be a half of this value - -Note: ITR settings could be changed in runtime by ethtool -c means (see below) - -Config file parameters -======================= -For some fine tuning and performance optimizations, -some parameters can be changed in the {source_dir}/aq_cfg.h file. - -AQ_CFG_RX_PAGEORDER ----------------------------------------- -Default value: 0 -RX page order override. Thats a power of 2 number of RX pages allocated for -each descriptor. Received descriptor size is still limited by AQ_CFG_RX_FRAME_MAX. -Increasing pageorder makes page reuse better (actual on iommu enabled systems). - -AQ_CFG_RX_REFILL_THRES ----------------------------------------- -Default value: 32 -RX refill threshold. RX path will not refill freed descriptors until the -specified number of free descriptors is observed. Larger values may help -better page reuse but may lead to packet drops as well. - -AQ_CFG_VECS_DEF ------------------------------------------------------------- -Number of queues -Valid Range: 0 - 8 (up to AQ_CFG_VECS_MAX) -Default value: 8 -Notice this value will be capped by the number of cores available on the system. - -AQ_CFG_IS_RSS_DEF ------------------------------------------------------------- -Enable/disable Receive Side Scaling - -This feature allows the adapter to distribute receive processing -across multiple CPU-cores and to prevent from overloading a single CPU core. - -Valid values -0 - disabled -1 - enabled - -Default value: 1 - -AQ_CFG_NUM_RSS_QUEUES_DEF ------------------------------------------------------------- -Number of queues for Receive Side Scaling -Valid Range: 0 - 8 (up to AQ_CFG_VECS_DEF) - -Default value: AQ_CFG_VECS_DEF - -AQ_CFG_IS_LRO_DEF ------------------------------------------------------------- -Enable/disable Large Receive Offload - -This offload enables the adapter to coalesce multiple TCP segments and indicate -them as a single coalesced unit to the OS networking subsystem. -The system consumes less energy but it also introduces more latency in packets processing. - -Valid values -0 - disabled -1 - enabled - -Default value: 1 - -AQ_CFG_TX_CLEAN_BUDGET ----------------------------------------- -Maximum descriptors to cleanup on TX at once. -Default value: 256 - -After the aq_cfg.h file changed the driver must be rebuilt to take effect. - -Support -======= - -If an issue is identified with the released source code on the supported -kernel with a supported adapter, email the specific information related -to the issue to aqn_support@marvell.com - -License -======= - -aQuantia Corporation Network Driver -Copyright(c) 2014 - 2019 aQuantia Corporation. - -This program is free software; you can redistribute it and/or modify it -under the terms and conditions of the GNU General Public License, -version 2, as published by the Free Software Foundation. diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst index 019a0d2efe67..7dde314fc957 100644 --- a/Documentation/networking/device_drivers/index.rst +++ b/Documentation/networking/device_drivers/index.rst @@ -30,6 +30,7 @@ Contents: 3com/3c509 3com/vortex amazon/ena + aquantia/atlantic .. only:: subproject and html diff --git a/MAINTAINERS b/MAINTAINERS index 7b6c13cc832f..b5cfee17635e 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1275,7 +1275,7 @@ L: netdev@vger.kernel.org S: Supported W: https://www.marvell.com/ Q: http://patchwork.ozlabs.org/project/netdev/list/ -F: Documentation/networking/device_drivers/aquantia/atlantic.txt +F: Documentation/networking/device_drivers/aquantia/atlantic.rst F: drivers/net/ethernet/aquantia/atlantic/ AQUANTIA ETHERNET DRIVER PTP SUBSYSTEM -- cgit v1.2.3 From c839ce557b35de084d06f91c4e37948bdcef9709 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:38 +0200 Subject: docs: networking: device drivers: convert chelsio/cxgb.txt to ReST - add SPDX header; - use copyright symbol; - adjust titles and chapters, adding proper markups; - comment out text-only TOC from html/pdf output; - mark code blocks and literals as such; - add notes markups; - mark tables as such; - mark lists as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- .../networking/device_drivers/chelsio/cxgb.rst | 393 +++++++++++++++++++++ .../networking/device_drivers/chelsio/cxgb.txt | 352 ------------------ Documentation/networking/device_drivers/index.rst | 1 + drivers/net/ethernet/chelsio/Kconfig | 2 +- 4 files changed, 395 insertions(+), 353 deletions(-) create mode 100644 Documentation/networking/device_drivers/chelsio/cxgb.rst delete mode 100644 Documentation/networking/device_drivers/chelsio/cxgb.txt (limited to 'Documentation') diff --git a/Documentation/networking/device_drivers/chelsio/cxgb.rst b/Documentation/networking/device_drivers/chelsio/cxgb.rst new file mode 100644 index 000000000000..435dce5fa2c7 --- /dev/null +++ b/Documentation/networking/device_drivers/chelsio/cxgb.rst @@ -0,0 +1,393 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +============================================= +Chelsio N210 10Gb Ethernet Network Controller +============================================= + +Driver Release Notes for Linux + +Version 2.1.1 + +June 20, 2005 + +.. Contents + + INTRODUCTION + FEATURES + PERFORMANCE + DRIVER MESSAGES + KNOWN ISSUES + SUPPORT + + +Introduction +============ + + This document describes the Linux driver for Chelsio 10Gb Ethernet Network + Controller. This driver supports the Chelsio N210 NIC and is backward + compatible with the Chelsio N110 model 10Gb NICs. + + +Features +======== + +Adaptive Interrupts (adaptive-rx) +--------------------------------- + + This feature provides an adaptive algorithm that adjusts the interrupt + coalescing parameters, allowing the driver to dynamically adapt the latency + settings to achieve the highest performance during various types of network + load. + + The interface used to control this feature is ethtool. Please see the + ethtool manpage for additional usage information. + + By default, adaptive-rx is disabled. + To enable adaptive-rx:: + + ethtool -C adaptive-rx on + + To disable adaptive-rx, use ethtool:: + + ethtool -C adaptive-rx off + + After disabling adaptive-rx, the timer latency value will be set to 50us. + You may set the timer latency after disabling adaptive-rx:: + + ethtool -C rx-usecs + + An example to set the timer latency value to 100us on eth0:: + + ethtool -C eth0 rx-usecs 100 + + You may also provide a timer latency value while disabling adaptive-rx:: + + ethtool -C adaptive-rx off rx-usecs + + If adaptive-rx is disabled and a timer latency value is specified, the timer + will be set to the specified value until changed by the user or until + adaptive-rx is enabled. + + To view the status of the adaptive-rx and timer latency values:: + + ethtool -c + + +TCP Segmentation Offloading (TSO) Support +----------------------------------------- + + This feature, also known as "large send", enables a system's protocol stack + to offload portions of outbound TCP processing to a network interface card + thereby reducing system CPU utilization and enhancing performance. + + The interface used to control this feature is ethtool version 1.8 or higher. + Please see the ethtool manpage for additional usage information. + + By default, TSO is enabled. + To disable TSO:: + + ethtool -K tso off + + To enable TSO:: + + ethtool -K tso on + + To view the status of TSO:: + + ethtool -k + + +Performance +=========== + + The following information is provided as an example of how to change system + parameters for "performance tuning" an what value to use. You may or may not + want to change these system parameters, depending on your server/workstation + application. Doing so is not warranted in any way by Chelsio Communications, + and is done at "YOUR OWN RISK". Chelsio will not be held responsible for loss + of data or damage to equipment. + + Your distribution may have a different way of doing things, or you may prefer + a different method. These commands are shown only to provide an example of + what to do and are by no means definitive. + + Making any of the following system changes will only last until you reboot + your system. You may want to write a script that runs at boot-up which + includes the optimal settings for your system. + + Setting PCI Latency Timer:: + + setpci -d 1425:: + +* 0x0c.l=0x0000F800 + + Disabling TCP timestamp:: + + sysctl -w net.ipv4.tcp_timestamps=0 + + Disabling SACK:: + + sysctl -w net.ipv4.tcp_sack=0 + + Setting large number of incoming connection requests:: + + sysctl -w net.ipv4.tcp_max_syn_backlog=3000 + + Setting maximum receive socket buffer size:: + + sysctl -w net.core.rmem_max=1024000 + + Setting maximum send socket buffer size:: + + sysctl -w net.core.wmem_max=1024000 + + Set smp_affinity (on a multiprocessor system) to a single CPU:: + + echo 1 > /proc/irq//smp_affinity + + Setting default receive socket buffer size:: + + sysctl -w net.core.rmem_default=524287 + + Setting default send socket buffer size:: + + sysctl -w net.core.wmem_default=524287 + + Setting maximum option memory buffers:: + + sysctl -w net.core.optmem_max=524287 + + Setting maximum backlog (# of unprocessed packets before kernel drops):: + + sysctl -w net.core.netdev_max_backlog=300000 + + Setting TCP read buffers (min/default/max):: + + sysctl -w net.ipv4.tcp_rmem="10000000 10000000 10000000" + + Setting TCP write buffers (min/pressure/max):: + + sysctl -w net.ipv4.tcp_wmem="10000000 10000000 10000000" + + Setting TCP buffer space (min/pressure/max):: + + sysctl -w net.ipv4.tcp_mem="10000000 10000000 10000000" + + TCP window size for single connections: + + The receive buffer (RX_WINDOW) size must be at least as large as the + Bandwidth-Delay Product of the communication link between the sender and + receiver. Due to the variations of RTT, you may want to increase the buffer + size up to 2 times the Bandwidth-Delay Product. Reference page 289 of + "TCP/IP Illustrated, Volume 1, The Protocols" by W. Richard Stevens. + + At 10Gb speeds, use the following formula:: + + RX_WINDOW >= 1.25MBytes * RTT(in milliseconds) + Example for RTT with 100us: RX_WINDOW = (1,250,000 * 0.1) = 125,000 + + RX_WINDOW sizes of 256KB - 512KB should be sufficient. + + Setting the min, max, and default receive buffer (RX_WINDOW) size:: + + sysctl -w net.ipv4.tcp_rmem=" " + + TCP window size for multiple connections: + The receive buffer (RX_WINDOW) size may be calculated the same as single + connections, but should be divided by the number of connections. The + smaller window prevents congestion and facilitates better pacing, + especially if/when MAC level flow control does not work well or when it is + not supported on the machine. Experimentation may be necessary to attain + the correct value. This method is provided as a starting point for the + correct receive buffer size. + + Setting the min, max, and default receive buffer (RX_WINDOW) size is + performed in the same manner as single connection. + + +Driver Messages +=============== + + The following messages are the most common messages logged by syslog. These + may be found in /var/log/messages. + + Driver up:: + + Chelsio Network Driver - version 2.1.1 + + NIC detected:: + + eth#: Chelsio N210 1x10GBaseX NIC (rev #), PCIX 133MHz/64-bit + + Link up:: + + eth#: link is up at 10 Gbps, full duplex + + Link down:: + + eth#: link is down + + +Known Issues +============ + + These issues have been identified during testing. The following information + is provided as a workaround to the problem. In some cases, this problem is + inherent to Linux or to a particular Linux Distribution and/or hardware + platform. + + 1. Large number of TCP retransmits on a multiprocessor (SMP) system. + + On a system with multiple CPUs, the interrupt (IRQ) for the network + controller may be bound to more than one CPU. This will cause TCP + retransmits if the packet data were to be split across different CPUs + and re-assembled in a different order than expected. + + To eliminate the TCP retransmits, set smp_affinity on the particular + interrupt to a single CPU. You can locate the interrupt (IRQ) used on + the N110/N210 by using ifconfig:: + + ifconfig | grep Interrupt + + Set the smp_affinity to a single CPU:: + + echo 1 > /proc/irq//smp_affinity + + It is highly suggested that you do not run the irqbalance daemon on your + system, as this will change any smp_affinity setting you have applied. + The irqbalance daemon runs on a 10 second interval and binds interrupts + to the least loaded CPU determined by the daemon. To disable this daemon:: + + chkconfig --level 2345 irqbalance off + + By default, some Linux distributions enable the kernel feature, + irqbalance, which performs the same function as the daemon. To disable + this feature, add the following line to your bootloader:: + + noirqbalance + + Example using the Grub bootloader:: + + title Red Hat Enterprise Linux AS (2.4.21-27.ELsmp) + root (hd0,0) + kernel /vmlinuz-2.4.21-27.ELsmp ro root=/dev/hda3 noirqbalance + initrd /initrd-2.4.21-27.ELsmp.img + + 2. After running insmod, the driver is loaded and the incorrect network + interface is brought up without running ifup. + + When using 2.4.x kernels, including RHEL kernels, the Linux kernel + invokes a script named "hotplug". This script is primarily used to + automatically bring up USB devices when they are plugged in, however, + the script also attempts to automatically bring up a network interface + after loading the kernel module. The hotplug script does this by scanning + the ifcfg-eth# config files in /etc/sysconfig/network-scripts, looking + for HWADDR=. + + If the hotplug script does not find the HWADDRR within any of the + ifcfg-eth# files, it will bring up the device with the next available + interface name. If this interface is already configured for a different + network card, your new interface will have incorrect IP address and + network settings. + + To solve this issue, you can add the HWADDR= key to the + interface config file of your network controller. + + To disable this "hotplug" feature, you may add the driver (module name) + to the "blacklist" file located in /etc/hotplug. It has been noted that + this does not work for network devices because the net.agent script + does not use the blacklist file. Simply remove, or rename, the net.agent + script located in /etc/hotplug to disable this feature. + + 3. Transport Protocol (TP) hangs when running heavy multi-connection traffic + on an AMD Opteron system with HyperTransport PCI-X Tunnel chipset. + + If your AMD Opteron system uses the AMD-8131 HyperTransport PCI-X Tunnel + chipset, you may experience the "133-Mhz Mode Split Completion Data + Corruption" bug identified by AMD while using a 133Mhz PCI-X card on the + bus PCI-X bus. + + AMD states, "Under highly specific conditions, the AMD-8131 PCI-X Tunnel + can provide stale data via split completion cycles to a PCI-X card that + is operating at 133 Mhz", causing data corruption. + + AMD's provides three workarounds for this problem, however, Chelsio + recommends the first option for best performance with this bug: + + For 133Mhz secondary bus operation, limit the transaction length and + the number of outstanding transactions, via BIOS configuration + programming of the PCI-X card, to the following: + + Data Length (bytes): 1k + + Total allowed outstanding transactions: 2 + + Please refer to AMD 8131-HT/PCI-X Errata 26310 Rev 3.08 August 2004, + section 56, "133-MHz Mode Split Completion Data Corruption" for more + details with this bug and workarounds suggested by AMD. + + It may be possible to work outside AMD's recommended PCI-X settings, try + increasing the Data Length to 2k bytes for increased performance. If you + have issues with these settings, please revert to the "safe" settings + and duplicate the problem before submitting a bug or asking for support. + + .. note:: + + The default setting on most systems is 8 outstanding transactions + and 2k bytes data length. + + 4. On multiprocessor systems, it has been noted that an application which + is handling 10Gb networking can switch between CPUs causing degraded + and/or unstable performance. + + If running on an SMP system and taking performance measurements, it + is suggested you either run the latest netperf-2.4.0+ or use a binding + tool such as Tim Hockin's procstate utilities (runon) + . + + Binding netserver and netperf (or other applications) to particular + CPUs will have a significant difference in performance measurements. + You may need to experiment which CPU to bind the application to in + order to achieve the best performance for your system. + + If you are developing an application designed for 10Gb networking, + please keep in mind you may want to look at kernel functions + sched_setaffinity & sched_getaffinity to bind your application. + + If you are just running user-space applications such as ftp, telnet, + etc., you may want to try the runon tool provided by Tim Hockin's + procstate utility. You could also try binding the interface to a + particular CPU: runon 0 ifup eth0 + + +Support +======= + + If you have problems with the software or hardware, please contact our + customer support team via email at support@chelsio.com or check our website + at http://www.chelsio.com + +------------------------------------------------------------------------------- + +:: + + Chelsio Communications + 370 San Aleso Ave. + Suite 100 + Sunnyvale, CA 94085 + http://www.chelsio.com + +This program is free software; you can redistribute it and/or modify +it under the terms of the GNU General Public License, version 2, as +published by the Free Software Foundation. + +You should have received a copy of the GNU General Public License along +with this program; if not, write to the Free Software Foundation, Inc., +59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + +THIS SOFTWARE IS PROVIDED ``AS IS`` AND WITHOUT ANY EXPRESS OR IMPLIED +WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. + +Copyright |copy| 2003-2005 Chelsio Communications. All rights reserved. diff --git a/Documentation/networking/device_drivers/chelsio/cxgb.txt b/Documentation/networking/device_drivers/chelsio/cxgb.txt deleted file mode 100644 index 20a887615c4a..000000000000 --- a/Documentation/networking/device_drivers/chelsio/cxgb.txt +++ /dev/null @@ -1,352 +0,0 @@ - Chelsio N210 10Gb Ethernet Network Controller - - Driver Release Notes for Linux - - Version 2.1.1 - - June 20, 2005 - -CONTENTS -======== - INTRODUCTION - FEATURES - PERFORMANCE - DRIVER MESSAGES - KNOWN ISSUES - SUPPORT - - -INTRODUCTION -============ - - This document describes the Linux driver for Chelsio 10Gb Ethernet Network - Controller. This driver supports the Chelsio N210 NIC and is backward - compatible with the Chelsio N110 model 10Gb NICs. - - -FEATURES -======== - - Adaptive Interrupts (adaptive-rx) - --------------------------------- - - This feature provides an adaptive algorithm that adjusts the interrupt - coalescing parameters, allowing the driver to dynamically adapt the latency - settings to achieve the highest performance during various types of network - load. - - The interface used to control this feature is ethtool. Please see the - ethtool manpage for additional usage information. - - By default, adaptive-rx is disabled. - To enable adaptive-rx: - - ethtool -C adaptive-rx on - - To disable adaptive-rx, use ethtool: - - ethtool -C adaptive-rx off - - After disabling adaptive-rx, the timer latency value will be set to 50us. - You may set the timer latency after disabling adaptive-rx: - - ethtool -C rx-usecs - - An example to set the timer latency value to 100us on eth0: - - ethtool -C eth0 rx-usecs 100 - - You may also provide a timer latency value while disabling adaptive-rx: - - ethtool -C adaptive-rx off rx-usecs - - If adaptive-rx is disabled and a timer latency value is specified, the timer - will be set to the specified value until changed by the user or until - adaptive-rx is enabled. - - To view the status of the adaptive-rx and timer latency values: - - ethtool -c - - - TCP Segmentation Offloading (TSO) Support - ----------------------------------------- - - This feature, also known as "large send", enables a system's protocol stack - to offload portions of outbound TCP processing to a network interface card - thereby reducing system CPU utilization and enhancing performance. - - The interface used to control this feature is ethtool version 1.8 or higher. - Please see the ethtool manpage for additional usage information. - - By default, TSO is enabled. - To disable TSO: - - ethtool -K tso off - - To enable TSO: - - ethtool -K tso on - - To view the status of TSO: - - ethtool -k - - -PERFORMANCE -=========== - - The following information is provided as an example of how to change system - parameters for "performance tuning" an what value to use. You may or may not - want to change these system parameters, depending on your server/workstation - application. Doing so is not warranted in any way by Chelsio Communications, - and is done at "YOUR OWN RISK". Chelsio will not be held responsible for loss - of data or damage to equipment. - - Your distribution may have a different way of doing things, or you may prefer - a different method. These commands are shown only to provide an example of - what to do and are by no means definitive. - - Making any of the following system changes will only last until you reboot - your system. You may want to write a script that runs at boot-up which - includes the optimal settings for your system. - - Setting PCI Latency Timer: - setpci -d 1425:* 0x0c.l=0x0000F800 - - Disabling TCP timestamp: - sysctl -w net.ipv4.tcp_timestamps=0 - - Disabling SACK: - sysctl -w net.ipv4.tcp_sack=0 - - Setting large number of incoming connection requests: - sysctl -w net.ipv4.tcp_max_syn_backlog=3000 - - Setting maximum receive socket buffer size: - sysctl -w net.core.rmem_max=1024000 - - Setting maximum send socket buffer size: - sysctl -w net.core.wmem_max=1024000 - - Set smp_affinity (on a multiprocessor system) to a single CPU: - echo 1 > /proc/irq//smp_affinity - - Setting default receive socket buffer size: - sysctl -w net.core.rmem_default=524287 - - Setting default send socket buffer size: - sysctl -w net.core.wmem_default=524287 - - Setting maximum option memory buffers: - sysctl -w net.core.optmem_max=524287 - - Setting maximum backlog (# of unprocessed packets before kernel drops): - sysctl -w net.core.netdev_max_backlog=300000 - - Setting TCP read buffers (min/default/max): - sysctl -w net.ipv4.tcp_rmem="10000000 10000000 10000000" - - Setting TCP write buffers (min/pressure/max): - sysctl -w net.ipv4.tcp_wmem="10000000 10000000 10000000" - - Setting TCP buffer space (min/pressure/max): - sysctl -w net.ipv4.tcp_mem="10000000 10000000 10000000" - - TCP window size for single connections: - The receive buffer (RX_WINDOW) size must be at least as large as the - Bandwidth-Delay Product of the communication link between the sender and - receiver. Due to the variations of RTT, you may want to increase the buffer - size up to 2 times the Bandwidth-Delay Product. Reference page 289 of - "TCP/IP Illustrated, Volume 1, The Protocols" by W. Richard Stevens. - At 10Gb speeds, use the following formula: - RX_WINDOW >= 1.25MBytes * RTT(in milliseconds) - Example for RTT with 100us: RX_WINDOW = (1,250,000 * 0.1) = 125,000 - RX_WINDOW sizes of 256KB - 512KB should be sufficient. - Setting the min, max, and default receive buffer (RX_WINDOW) size: - sysctl -w net.ipv4.tcp_rmem=" " - - TCP window size for multiple connections: - The receive buffer (RX_WINDOW) size may be calculated the same as single - connections, but should be divided by the number of connections. The - smaller window prevents congestion and facilitates better pacing, - especially if/when MAC level flow control does not work well or when it is - not supported on the machine. Experimentation may be necessary to attain - the correct value. This method is provided as a starting point for the - correct receive buffer size. - Setting the min, max, and default receive buffer (RX_WINDOW) size is - performed in the same manner as single connection. - - -DRIVER MESSAGES -=============== - - The following messages are the most common messages logged by syslog. These - may be found in /var/log/messages. - - Driver up: - Chelsio Network Driver - version 2.1.1 - - NIC detected: - eth#: Chelsio N210 1x10GBaseX NIC (rev #), PCIX 133MHz/64-bit - - Link up: - eth#: link is up at 10 Gbps, full duplex - - Link down: - eth#: link is down - - -KNOWN ISSUES -============ - - These issues have been identified during testing. The following information - is provided as a workaround to the problem. In some cases, this problem is - inherent to Linux or to a particular Linux Distribution and/or hardware - platform. - - 1. Large number of TCP retransmits on a multiprocessor (SMP) system. - - On a system with multiple CPUs, the interrupt (IRQ) for the network - controller may be bound to more than one CPU. This will cause TCP - retransmits if the packet data were to be split across different CPUs - and re-assembled in a different order than expected. - - To eliminate the TCP retransmits, set smp_affinity on the particular - interrupt to a single CPU. You can locate the interrupt (IRQ) used on - the N110/N210 by using ifconfig: - ifconfig | grep Interrupt - Set the smp_affinity to a single CPU: - echo 1 > /proc/irq//smp_affinity - - It is highly suggested that you do not run the irqbalance daemon on your - system, as this will change any smp_affinity setting you have applied. - The irqbalance daemon runs on a 10 second interval and binds interrupts - to the least loaded CPU determined by the daemon. To disable this daemon: - chkconfig --level 2345 irqbalance off - - By default, some Linux distributions enable the kernel feature, - irqbalance, which performs the same function as the daemon. To disable - this feature, add the following line to your bootloader: - noirqbalance - - Example using the Grub bootloader: - title Red Hat Enterprise Linux AS (2.4.21-27.ELsmp) - root (hd0,0) - kernel /vmlinuz-2.4.21-27.ELsmp ro root=/dev/hda3 noirqbalance - initrd /initrd-2.4.21-27.ELsmp.img - - 2. After running insmod, the driver is loaded and the incorrect network - interface is brought up without running ifup. - - When using 2.4.x kernels, including RHEL kernels, the Linux kernel - invokes a script named "hotplug". This script is primarily used to - automatically bring up USB devices when they are plugged in, however, - the script also attempts to automatically bring up a network interface - after loading the kernel module. The hotplug script does this by scanning - the ifcfg-eth# config files in /etc/sysconfig/network-scripts, looking - for HWADDR=. - - If the hotplug script does not find the HWADDRR within any of the - ifcfg-eth# files, it will bring up the device with the next available - interface name. If this interface is already configured for a different - network card, your new interface will have incorrect IP address and - network settings. - - To solve this issue, you can add the HWADDR= key to the - interface config file of your network controller. - - To disable this "hotplug" feature, you may add the driver (module name) - to the "blacklist" file located in /etc/hotplug. It has been noted that - this does not work for network devices because the net.agent script - does not use the blacklist file. Simply remove, or rename, the net.agent - script located in /etc/hotplug to disable this feature. - - 3. Transport Protocol (TP) hangs when running heavy multi-connection traffic - on an AMD Opteron system with HyperTransport PCI-X Tunnel chipset. - - If your AMD Opteron system uses the AMD-8131 HyperTransport PCI-X Tunnel - chipset, you may experience the "133-Mhz Mode Split Completion Data - Corruption" bug identified by AMD while using a 133Mhz PCI-X card on the - bus PCI-X bus. - - AMD states, "Under highly specific conditions, the AMD-8131 PCI-X Tunnel - can provide stale data via split completion cycles to a PCI-X card that - is operating at 133 Mhz", causing data corruption. - - AMD's provides three workarounds for this problem, however, Chelsio - recommends the first option for best performance with this bug: - - For 133Mhz secondary bus operation, limit the transaction length and - the number of outstanding transactions, via BIOS configuration - programming of the PCI-X card, to the following: - - Data Length (bytes): 1k - Total allowed outstanding transactions: 2 - - Please refer to AMD 8131-HT/PCI-X Errata 26310 Rev 3.08 August 2004, - section 56, "133-MHz Mode Split Completion Data Corruption" for more - details with this bug and workarounds suggested by AMD. - - It may be possible to work outside AMD's recommended PCI-X settings, try - increasing the Data Length to 2k bytes for increased performance. If you - have issues with these settings, please revert to the "safe" settings - and duplicate the problem before submitting a bug or asking for support. - - NOTE: The default setting on most systems is 8 outstanding transactions - and 2k bytes data length. - - 4. On multiprocessor systems, it has been noted that an application which - is handling 10Gb networking can switch between CPUs causing degraded - and/or unstable performance. - - If running on an SMP system and taking performance measurements, it - is suggested you either run the latest netperf-2.4.0+ or use a binding - tool such as Tim Hockin's procstate utilities (runon) - . - - Binding netserver and netperf (or other applications) to particular - CPUs will have a significant difference in performance measurements. - You may need to experiment which CPU to bind the application to in - order to achieve the best performance for your system. - - If you are developing an application designed for 10Gb networking, - please keep in mind you may want to look at kernel functions - sched_setaffinity & sched_getaffinity to bind your application. - - If you are just running user-space applications such as ftp, telnet, - etc., you may want to try the runon tool provided by Tim Hockin's - procstate utility. You could also try binding the interface to a - particular CPU: runon 0 ifup eth0 - - -SUPPORT -======= - - If you have problems with the software or hardware, please contact our - customer support team via email at support@chelsio.com or check our website - at http://www.chelsio.com - -=============================================================================== - - Chelsio Communications - 370 San Aleso Ave. - Suite 100 - Sunnyvale, CA 94085 - http://www.chelsio.com - -This program is free software; you can redistribute it and/or modify -it under the terms of the GNU General Public License, version 2, as -published by the Free Software Foundation. - -You should have received a copy of the GNU General Public License along -with this program; if not, write to the Free Software Foundation, Inc., -59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. - -THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED -WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF -MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. - - Copyright (c) 2003-2005 Chelsio Communications. All rights reserved. - -=============================================================================== diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst index 7dde314fc957..23c4ec9c9125 100644 --- a/Documentation/networking/device_drivers/index.rst +++ b/Documentation/networking/device_drivers/index.rst @@ -31,6 +31,7 @@ Contents: 3com/vortex amazon/ena aquantia/atlantic + chelsio/cxgb .. only:: subproject and html diff --git a/drivers/net/ethernet/chelsio/Kconfig b/drivers/net/ethernet/chelsio/Kconfig index 9909bfda167e..82cdfa51ce37 100644 --- a/drivers/net/ethernet/chelsio/Kconfig +++ b/drivers/net/ethernet/chelsio/Kconfig @@ -26,7 +26,7 @@ config CHELSIO_T1 This driver supports Chelsio gigabit and 10-gigabit Ethernet cards. More information about adapter features and performance tuning is in - . + . For general information about Chelsio and our products, visit our website at . -- cgit v1.2.3 From 714a4da450c03bdc53a0d5fa6a4b3192b30c5cda Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:39 +0200 Subject: docs: networking: device drivers: convert cirrus/cs89x0.txt to ReST - add SPDX header; - adjust title markup; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- .../networking/device_drivers/cirrus/cs89x0.rst | 647 +++++++++++++++++++++ .../networking/device_drivers/cirrus/cs89x0.txt | 624 -------------------- Documentation/networking/device_drivers/index.rst | 1 + drivers/net/ethernet/cirrus/Kconfig | 2 +- 4 files changed, 649 insertions(+), 625 deletions(-) create mode 100644 Documentation/networking/device_drivers/cirrus/cs89x0.rst delete mode 100644 Documentation/networking/device_drivers/cirrus/cs89x0.txt (limited to 'Documentation') diff --git a/Documentation/networking/device_drivers/cirrus/cs89x0.rst b/Documentation/networking/device_drivers/cirrus/cs89x0.rst new file mode 100644 index 000000000000..e5c283940ac5 --- /dev/null +++ b/Documentation/networking/device_drivers/cirrus/cs89x0.rst @@ -0,0 +1,647 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================================ +Cirrus Logic LAN CS8900/CS8920 Ethernet Adapters +================================================ + +.. note:: + + This document was contributed by Cirrus Logic for kernel 2.2.5. This version + has been updated for 2.3.48 by Andrew Morton. + + Still, this is too outdated! A major cleanup is needed here. + +Cirrus make a copy of this driver available at their website, as +described below. In general, you should use the driver version which +comes with your Linux distribution. + + +Linux Network Interface Driver ver. 2.00 + + +.. TABLE OF CONTENTS + + 1.0 CIRRUS LOGIC LAN CS8900/CS8920 ETHERNET ADAPTERS + 1.1 Product Overview + 1.2 Driver Description + 1.2.1 Driver Name + 1.2.2 File in the Driver Package + 1.3 System Requirements + 1.4 Licensing Information + + 2.0 ADAPTER INSTALLATION and CONFIGURATION + 2.1 CS8900-based Adapter Configuration + 2.2 CS8920-based Adapter Configuration + + 3.0 LOADING THE DRIVER AS A MODULE + + 4.0 COMPILING THE DRIVER + 4.1 Compiling the Driver as a Loadable Module + 4.2 Compiling the driver to support memory mode + 4.3 Compiling the driver to support Rx DMA + + 5.0 TESTING AND TROUBLESHOOTING + 5.1 Known Defects and Limitations + 5.2 Testing the Adapter + 5.2.1 Diagnostic Self-Test + 5.2.2 Diagnostic Network Test + 5.3 Using the Adapter's LEDs + 5.4 Resolving I/O Conflicts + + 6.0 TECHNICAL SUPPORT + 6.1 Contacting Cirrus Logic's Technical Support + 6.2 Information Required Before Contacting Technical Support + 6.3 Obtaining the Latest Driver Version + 6.4 Current maintainer + 6.5 Kernel boot parameters + + +1. Cirrus Logic LAN CS8900/CS8920 Ethernet Adapters +=================================================== + + +1.1. Product Overview +===================== + +The CS8900-based ISA Ethernet Adapters from Cirrus Logic follow +IEEE 802.3 standards and support half or full-duplex operation in ISA bus +computers on 10 Mbps Ethernet networks. The adapters are designed for operation +in 16-bit ISA or EISA bus expansion slots and are available in +10BaseT-only or 3-media configurations (10BaseT, 10Base2, and AUI for 10Base-5 +or fiber networks). + +CS8920-based adapters are similar to the CS8900-based adapter with additional +features for Plug and Play (PnP) support and Wakeup Frame recognition. As +such, the configuration procedures differ somewhat between the two types of +adapters. Refer to the "Adapter Configuration" section for details on +configuring both types of adapters. + + +1.2. Driver Description +======================= + +The CS8900/CS8920 Ethernet Adapter driver for Linux supports the Linux +v2.3.48 or greater kernel. It can be compiled directly into the kernel +or loaded at run-time as a device driver module. + +1.2.1 Driver Name: cs89x0 + +1.2.2 Files in the Driver Archive: + +The files in the driver at Cirrus' website include: + + =================== ==================================================== + readme.txt this file + build batch file to compile cs89x0.c. + cs89x0.c driver C code + cs89x0.h driver header file + cs89x0.o pre-compiled module (for v2.2.5 kernel) + config/Config.in sample file to include cs89x0 driver in the kernel. + config/Makefile sample file to include cs89x0 driver in the kernel. + config/Space.c sample file to include cs89x0 driver in the kernel. + =================== ==================================================== + + + +1.3. System Requirements +------------------------ + +The following hardware is required: + + * Cirrus Logic LAN (CS8900/20-based) Ethernet ISA Adapter + + * IBM or IBM-compatible PC with: + * An 80386 or higher processor + * 16 bytes of contiguous IO space available between 210h - 370h + * One available IRQ (5,10,11,or 12 for the CS8900, 3-7,9-15 for CS8920). + + * Appropriate cable (and connector for AUI, 10BASE-2) for your network + topology. + +The following software is required: + +* LINUX kernel version 2.3.48 or higher + + * CS8900/20 Setup Utility (DOS-based) + + * LINUX kernel sources for your kernel (if compiling into kernel) + + * GNU Toolkit (gcc and make) v2.6 or above (if compiling into kernel + or a module) + + + +1.4. Licensing Information +-------------------------- + +This program is free software; you can redistribute it and/or modify it under +the terms of the GNU General Public License as published by the Free Software +Foundation, version 1. + +This program is distributed in the hope that it will be useful, but WITHOUT +ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or +FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for +more details. + +For a full copy of the GNU General Public License, write to the Free Software +Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + + + +2. Adapter Installation and Configuration +========================================= + +Both the CS8900 and CS8920-based adapters can be configured using parameters +stored in an on-board EEPROM. You must use the DOS-based CS8900/20 Setup +Utility if you want to change the adapter's configuration in EEPROM. + +When loading the driver as a module, you can specify many of the adapter's +configuration parameters on the command-line to override the EEPROM's settings +or for interface configuration when an EEPROM is not used. (CS8920-based +adapters must use an EEPROM.) See Section 3.0 LOADING THE DRIVER AS A MODULE. + +Since the CS8900/20 Setup Utility is a DOS-based application, you must install +and configure the adapter in a DOS-based system using the CS8900/20 Setup +Utility before installation in the target LINUX system. (Not required if +installing a CS8900-based adapter and the default configuration is acceptable.) + + +2.1. CS8900-based Adapter Configuration +--------------------------------------- + +CS8900-based adapters shipped from Cirrus Logic have been configured +with the following "default" settings:: + + Operation Mode: Memory Mode + IRQ: 10 + Base I/O Address: 300 + Memory Base Address: D0000 + Optimization: DOS Client + Transmission Mode: Half-duplex + BootProm: None + Media Type: Autodetect (3-media cards) or + 10BASE-T (10BASE-T only adapter) + +You should only change the default configuration settings if conflicts with +another adapter exists. To change the adapter's configuration, run the +CS8900/20 Setup Utility. + + +2.2. CS8920-based Adapter Configuration +--------------------------------------- + +CS8920-based adapters are shipped from Cirrus Logic configured as Plug +and Play (PnP) enabled. However, since the cs89x0 driver does NOT +support PnP, you must install the CS8920 adapter in a DOS-based PC and +run the CS8900/20 Setup Utility to disable PnP and configure the +adapter before installation in the target Linux system. Failure to do +this will leave the adapter inactive and the driver will be unable to +communicate with the adapter. + +:: + + **************************************************************** + * CS8920-BASED ADAPTERS: * + * * + * CS8920-BASED ADAPTERS ARE PLUG and PLAY ENABLED BY DEFAULT. * + * THE CS89X0 DRIVER DOES NOT SUPPORT PnP. THEREFORE, YOU MUST * + * RUN THE CS8900/20 SETUP UTILITY TO DISABLE PnP SUPPORT AND * + * TO ACTIVATE THE ADAPTER. * + **************************************************************** + + + + +3. Loading the Driver as a Module +================================= + +If the driver is compiled as a loadable module, you can load the driver module +with the 'modprobe' command. Many of the adapter's configuration parameters can +be specified as command-line arguments to the load command. This facility +provides a means to override the EEPROM's settings or for interface +configuration when an EEPROM is not used. + +Example:: + + insmod cs89x0.o io=0x200 irq=0xA media=aui + +This example loads the module and configures the adapter to use an IO port base +address of 200h, interrupt 10, and use the AUI media connection. The following +configuration options are available on the command line:: + + io=### - specify IO address (200h-360h) + irq=## - specify interrupt level + use_dma=1 - Enable DMA + dma=# - specify dma channel (Driver is compiled to support + Rx DMA only) + dmasize=# (16 or 64) - DMA size 16K or 64K. Default value is set to 16. + media=rj45 - specify media type + or media=bnc + or media=aui + or media=auto + duplex=full - specify forced half/full/autonegotiate duplex + or duplex=half + or duplex=auto + debug=# - debug level (only available if the driver was compiled + for debugging) + +**Notes:** + +a) If an EEPROM is present, any specified command-line parameter + will override the corresponding configuration value stored in + EEPROM. + +b) The "io" parameter must be specified on the command-line. + +c) The driver's hardware probe routine is designed to avoid + writing to I/O space until it knows that there is a cs89x0 + card at the written addresses. This could cause problems + with device probing. To avoid this behaviour, add one + to the ``io=`` module parameter. This doesn't actually change + the I/O address, but it is a flag to tell the driver + to partially initialise the hardware before trying to + identify the card. This could be dangerous if you are + not sure that there is a cs89x0 card at the provided address. + + For example, to scan for an adapter located at IO base 0x300, + specify an IO address of 0x301. + +d) The "duplex=auto" parameter is only supported for the CS8920. + +e) The minimum command-line configuration required if an EEPROM is + not present is: + + io + irq + media type (no autodetect) + +f) The following additional parameters are CS89XX defaults (values + used with no EEPROM or command-line argument). + + * DMA Burst = enabled + * IOCHRDY Enabled = enabled + * UseSA = enabled + * CS8900 defaults to half-duplex if not specified on command-line + * CS8920 defaults to autoneg if not specified on command-line + * Use reset defaults for other config parameters + * dma_mode = 0 + +g) You can use ifconfig to set the adapter's Ethernet address. + +h) Many Linux distributions use the 'modprobe' command to load + modules. This program uses the '/etc/conf.modules' file to + determine configuration information which is passed to a driver + module when it is loaded. All the configuration options which are + described above may be placed within /etc/conf.modules. + + For example:: + + > cat /etc/conf.modules + ... + alias eth0 cs89x0 + options cs89x0 io=0x0200 dma=5 use_dma=1 + ... + + In this example we are telling the module system that the + ethernet driver for this machine should use the cs89x0 driver. We + are asking 'modprobe' to pass the 'io', 'dma' and 'use_dma' + arguments to the driver when it is loaded. + +i) Cirrus recommend that the cs89x0 use the ISA DMA channels 5, 6 or + 7. You will probably find that other DMA channels will not work. + +j) The cs89x0 supports DMA for receiving only. DMA mode is + significantly more efficient. Flooding a 400 MHz Celeron machine + with large ping packets consumes 82% of its CPU capacity in non-DMA + mode. With DMA this is reduced to 45%. + +k) If your Linux kernel was compiled with inbuilt plug-and-play + support you will be able to find information about the cs89x0 card + with the command:: + + cat /proc/isapnp + +l) If during DMA operation you find erratic behavior or network data + corruption you should use your PC's BIOS to slow the EISA bus clock. + +m) If the cs89x0 driver is compiled directly into the kernel + (non-modular) then its I/O address is automatically determined by + ISA bus probing. The IRQ number, media options, etc are determined + from the card's EEPROM. + +n) If the cs89x0 driver is compiled directly into the kernel, DMA + mode may be selected by providing the kernel with a boot option + 'cs89x0_dma=N' where 'N' is the desired DMA channel number (5, 6 or 7). + + Kernel boot options may be provided on the LILO command line:: + + LILO boot: linux cs89x0_dma=5 + + or they may be placed in /etc/lilo.conf:: + + image=/boot/bzImage-2.3.48 + append="cs89x0_dma=5" + label=linux + root=/dev/hda5 + read-only + + The DMA Rx buffer size is hardwired to 16 kbytes in this mode. + (64k mode is not available). + + +4. Compiling the Driver +======================= + +The cs89x0 driver can be compiled directly into the kernel or compiled into +a loadable device driver module. + +Just use the standard way to configure the driver and compile the Kernel. + + +4.1. Compiling the Driver to Support Rx DMA +------------------------------------------- + +The compile-time optionality for DMA was removed in the 2.3 kernel +series. DMA support is now unconditionally part of the driver. It is +enabled by the 'use_dma=1' module option. + + +5. Testing and Troubleshooting +============================== + +5.1. Known Defects and Limitations +---------------------------------- + +Refer to the RELEASE.TXT file distributed as part of this archive for a list of +known defects, driver limitations, and work arounds. + + +5.2. Testing the Adapter +------------------------ + +Once the adapter has been installed and configured, the diagnostic option of +the CS8900/20 Setup Utility can be used to test the functionality of the +adapter and its network connection. Use the diagnostics 'Self Test' option to +test the functionality of the adapter with the hardware configuration you have +assigned. You can use the diagnostics 'Network Test' to test the ability of the +adapter to communicate across the Ethernet with another PC equipped with a +CS8900/20-based adapter card (it must also be running the CS8900/20 Setup +Utility). + +.. note:: + + The Setup Utility's diagnostics are designed to run in a + DOS-only operating system environment. DO NOT run the diagnostics + from a DOS or command prompt session under Windows 95, Windows NT, + OS/2, or other operating system. + +To run the diagnostics tests on the CS8900/20 adapter: + + 1. Boot DOS on the PC and start the CS8900/20 Setup Utility. + + 2. The adapter's current configuration is displayed. Hit the ENTER key to + get to the main menu. + + 4. Select 'Diagnostics' (ALT-G) from the main menu. + * Select 'Self-Test' to test the adapter's basic functionality. + * Select 'Network Test' to test the network connection and cabling. + + +5.2.1. Diagnostic Self-test +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The diagnostic self-test checks the adapter's basic functionality as well as +its ability to communicate across the ISA bus based on the system resources +assigned during hardware configuration. The following tests are performed: + + * IO Register Read/Write Test + + The IO Register Read/Write test insures that the CS8900/20 can be + accessed in IO mode, and that the IO base address is correct. + + * Shared Memory Test + + The Shared Memory test insures the CS8900/20 can be accessed in memory + mode and that the range of memory addresses assigned does not conflict + with other devices in the system. + + * Interrupt Test + + The Interrupt test insures there are no conflicts with the assigned IRQ + signal. + + * EEPROM Test + + The EEPROM test insures the EEPROM can be read. + + * Chip RAM Test + + The Chip RAM test insures the 4K of memory internal to the CS8900/20 is + working properly. + + * Internal Loop-back Test + + The Internal Loop Back test insures the adapter's transmitter and + receiver are operating properly. If this test fails, make sure the + adapter's cable is connected to the network (check for LED activity for + example). + + * Boot PROM Test + + The Boot PROM test insures the Boot PROM is present, and can be read. + Failure indicates the Boot PROM was not successfully read due to a + hardware problem or due to a conflicts on the Boot PROM address + assignment. (Test only applies if the adapter is configured to use the + Boot PROM option.) + +Failure of a test item indicates a possible system resource conflict with +another device on the ISA bus. In this case, you should use the Manual Setup +option to reconfigure the adapter by selecting a different value for the system +resource that failed. + + +5.2.2. Diagnostic Network Test +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The Diagnostic Network Test verifies a working network connection by +transferring data between two CS8900/20 adapters installed in different PCs +on the same network. (Note: the diagnostic network test should not be run +between two nodes across a router.) + +This test requires that each of the two PCs have a CS8900/20-based adapter +installed and have the CS8900/20 Setup Utility running. The first PC is +configured as a Responder and the other PC is configured as an Initiator. +Once the Initiator is started, it sends data frames to the Responder which +returns the frames to the Initiator. + +The total number of frames received and transmitted are displayed on the +Initiator's display, along with a count of the number of frames received and +transmitted OK or in error. The test can be terminated anytime by the user at +either PC. + +To setup the Diagnostic Network Test: + + 1. Select a PC with a CS8900/20-based adapter and a known working network + connection to act as the Responder. Run the CS8900/20 Setup Utility + and select 'Diagnostics -> Network Test -> Responder' from the main + menu. Hit ENTER to start the Responder. + + 2. Return to the PC with the CS8900/20-based adapter you want to test and + start the CS8900/20 Setup Utility. + + 3. From the main menu, Select 'Diagnostic -> Network Test -> Initiator'. + Hit ENTER to start the test. + +You may stop the test on the Initiator at any time while allowing the Responder +to continue running. In this manner, you can move to additional PCs and test +them by starting the Initiator on another PC without having to stop/start the +Responder. + + + +5.3. Using the Adapter's LEDs +----------------------------- + +The 2 and 3-media adapters have two LEDs visible on the back end of the board +located near the 10Base-T connector. + +Link Integrity LED: A "steady" ON of the green LED indicates a valid 10Base-T +connection. (Only applies to 10Base-T. The green LED has no significance for +a 10Base-2 or AUI connection.) + +TX/RX LED: The yellow LED lights briefly each time the adapter transmits or +receives data. (The yellow LED will appear to "flicker" on a typical network.) + + +5.4. Resolving I/O Conflicts +---------------------------- + +An IO conflict occurs when two or more adapter use the same ISA resource (IO +address, memory address or IRQ). You can usually detect an IO conflict in one +of four ways after installing and or configuring the CS8900/20-based adapter: + + 1. The system does not boot properly (or at all). + + 2. The driver cannot communicate with the adapter, reporting an "Adapter + not found" error message. + + 3. You cannot connect to the network or the driver will not load. + + 4. If you have configured the adapter to run in memory mode but the driver + reports it is using IO mode when loading, this is an indication of a + memory address conflict. + +If an IO conflict occurs, run the CS8900/20 Setup Utility and perform a +diagnostic self-test. Normally, the ISA resource in conflict will fail the +self-test. If so, reconfigure the adapter selecting another choice for the +resource in conflict. Run the diagnostics again to check for further IO +conflicts. + +In some cases, such as when the PC will not boot, it may be necessary to remove +the adapter and reconfigure it by installing it in another PC to run the +CS8900/20 Setup Utility. Once reinstalled in the target system, run the +diagnostics self-test to ensure the new configuration is free of conflicts +before loading the driver again. + +When manually configuring the adapter, keep in mind the typical ISA system +resource usage as indicated in the tables below. + +:: + + I/O Address Device IRQ Device + ----------- -------- --- -------- + 200-20F Game I/O adapter 3 COM2, Bus Mouse + 230-23F Bus Mouse 4 COM1 + 270-27F LPT3: third parallel port 5 LPT2 + 2F0-2FF COM2: second serial port 6 Floppy Disk controller + 320-32F Fixed disk controller 7 LPT1 + 8 Real-time Clock + 9 EGA/VGA display adapter + 12 Mouse (PS/2) + Memory Address Device 13 Math Coprocessor + -------------- --------------------- 14 Hard Disk controller + A000-BFFF EGA Graphics Adapter + A000-C7FF VGA Graphics Adapter + B000-BFFF Mono Graphics Adapter + B800-BFFF Color Graphics Adapter + E000-FFFF AT BIOS + + + + +6. Technical Support +==================== + +6.1. Contacting Cirrus Logic's Technical Support +------------------------------------------------ + +Cirrus Logic's CS89XX Technical Application Support can be reached at:: + + Telephone :(800) 888-5016 (from inside U.S. and Canada) + :(512) 442-7555 (from outside the U.S. and Canada) + Fax :(512) 912-3871 + Email :ethernet@crystal.cirrus.com + WWW :http://www.cirrus.com + + +6.2. Information Required before Contacting Technical Support +------------------------------------------------------------- + +Before contacting Cirrus Logic for technical support, be prepared to provide as +Much of the following information as possible. + +1.) Adapter type (CRD8900, CDB8900, CDB8920, etc.) + +2.) Adapter configuration + + * IO Base, Memory Base, IO or memory mode enabled, IRQ, DMA channel + * Plug and Play enabled/disabled (CS8920-based adapters only) + * Configured for media auto-detect or specific media type (which type). + +3.) PC System's Configuration + + * Plug and Play system (yes/no) + * BIOS (make and version) + * System make and model + * CPU (type and speed) + * System RAM + * SCSI Adapter + +4.) Software + + * CS89XX driver and version + * Your network operating system and version + * Your system's OS version + * Version of all protocol support files + +5.) Any Error Message displayed. + + + +6.3 Obtaining the Latest Driver Version +--------------------------------------- + +You can obtain the latest CS89XX drivers and support software from Cirrus Logic's +Web site. You can also contact Cirrus Logic's Technical Support (email: +ethernet@crystal.cirrus.com) and request that you be registered for automatic +software-update notification. + +Cirrus Logic maintains a web page at http://www.cirrus.com with the +latest drivers and technical publications. + + +6.4. Current maintainer +----------------------- + +In February 2000 the maintenance of this driver was assumed by Andrew +Morton. + +6.5 Kernel module parameters +---------------------------- + +For use in embedded environments with no cs89x0 EEPROM, the kernel boot +parameter ``cs89x0_media=`` has been implemented. Usage is:: + + cs89x0_media=rj45 or + cs89x0_media=aui or + cs89x0_media=bnc diff --git a/Documentation/networking/device_drivers/cirrus/cs89x0.txt b/Documentation/networking/device_drivers/cirrus/cs89x0.txt deleted file mode 100644 index 0e190180eec8..000000000000 --- a/Documentation/networking/device_drivers/cirrus/cs89x0.txt +++ /dev/null @@ -1,624 +0,0 @@ - -NOTE ----- - -This document was contributed by Cirrus Logic for kernel 2.2.5. This version -has been updated for 2.3.48 by Andrew Morton. - -Cirrus make a copy of this driver available at their website, as -described below. In general, you should use the driver version which -comes with your Linux distribution. - - - -CIRRUS LOGIC LAN CS8900/CS8920 ETHERNET ADAPTERS -Linux Network Interface Driver ver. 2.00 -=============================================================================== - - -TABLE OF CONTENTS - -1.0 CIRRUS LOGIC LAN CS8900/CS8920 ETHERNET ADAPTERS - 1.1 Product Overview - 1.2 Driver Description - 1.2.1 Driver Name - 1.2.2 File in the Driver Package - 1.3 System Requirements - 1.4 Licensing Information - -2.0 ADAPTER INSTALLATION and CONFIGURATION - 2.1 CS8900-based Adapter Configuration - 2.2 CS8920-based Adapter Configuration - -3.0 LOADING THE DRIVER AS A MODULE - -4.0 COMPILING THE DRIVER - 4.1 Compiling the Driver as a Loadable Module - 4.2 Compiling the driver to support memory mode - 4.3 Compiling the driver to support Rx DMA - -5.0 TESTING AND TROUBLESHOOTING - 5.1 Known Defects and Limitations - 5.2 Testing the Adapter - 5.2.1 Diagnostic Self-Test - 5.2.2 Diagnostic Network Test - 5.3 Using the Adapter's LEDs - 5.4 Resolving I/O Conflicts - -6.0 TECHNICAL SUPPORT - 6.1 Contacting Cirrus Logic's Technical Support - 6.2 Information Required Before Contacting Technical Support - 6.3 Obtaining the Latest Driver Version - 6.4 Current maintainer - 6.5 Kernel boot parameters - - -1.0 CIRRUS LOGIC LAN CS8900/CS8920 ETHERNET ADAPTERS -=============================================================================== - - -1.1 PRODUCT OVERVIEW - -The CS8900-based ISA Ethernet Adapters from Cirrus Logic follow -IEEE 802.3 standards and support half or full-duplex operation in ISA bus -computers on 10 Mbps Ethernet networks. The adapters are designed for operation -in 16-bit ISA or EISA bus expansion slots and are available in -10BaseT-only or 3-media configurations (10BaseT, 10Base2, and AUI for 10Base-5 -or fiber networks). - -CS8920-based adapters are similar to the CS8900-based adapter with additional -features for Plug and Play (PnP) support and Wakeup Frame recognition. As -such, the configuration procedures differ somewhat between the two types of -adapters. Refer to the "Adapter Configuration" section for details on -configuring both types of adapters. - - -1.2 DRIVER DESCRIPTION - -The CS8900/CS8920 Ethernet Adapter driver for Linux supports the Linux -v2.3.48 or greater kernel. It can be compiled directly into the kernel -or loaded at run-time as a device driver module. - -1.2.1 Driver Name: cs89x0 - -1.2.2 Files in the Driver Archive: - -The files in the driver at Cirrus' website include: - - readme.txt - this file - build - batch file to compile cs89x0.c. - cs89x0.c - driver C code - cs89x0.h - driver header file - cs89x0.o - pre-compiled module (for v2.2.5 kernel) - config/Config.in - sample file to include cs89x0 driver in the kernel. - config/Makefile - sample file to include cs89x0 driver in the kernel. - config/Space.c - sample file to include cs89x0 driver in the kernel. - - - -1.3 SYSTEM REQUIREMENTS - -The following hardware is required: - - * Cirrus Logic LAN (CS8900/20-based) Ethernet ISA Adapter - - * IBM or IBM-compatible PC with: - * An 80386 or higher processor - * 16 bytes of contiguous IO space available between 210h - 370h - * One available IRQ (5,10,11,or 12 for the CS8900, 3-7,9-15 for CS8920). - - * Appropriate cable (and connector for AUI, 10BASE-2) for your network - topology. - -The following software is required: - -* LINUX kernel version 2.3.48 or higher - - * CS8900/20 Setup Utility (DOS-based) - - * LINUX kernel sources for your kernel (if compiling into kernel) - - * GNU Toolkit (gcc and make) v2.6 or above (if compiling into kernel - or a module) - - - -1.4 LICENSING INFORMATION - -This program is free software; you can redistribute it and/or modify it under -the terms of the GNU General Public License as published by the Free Software -Foundation, version 1. - -This program is distributed in the hope that it will be useful, but WITHOUT -ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or -FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for -more details. - -For a full copy of the GNU General Public License, write to the Free Software -Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. - - - -2.0 ADAPTER INSTALLATION and CONFIGURATION -=============================================================================== - -Both the CS8900 and CS8920-based adapters can be configured using parameters -stored in an on-board EEPROM. You must use the DOS-based CS8900/20 Setup -Utility if you want to change the adapter's configuration in EEPROM. - -When loading the driver as a module, you can specify many of the adapter's -configuration parameters on the command-line to override the EEPROM's settings -or for interface configuration when an EEPROM is not used. (CS8920-based -adapters must use an EEPROM.) See Section 3.0 LOADING THE DRIVER AS A MODULE. - -Since the CS8900/20 Setup Utility is a DOS-based application, you must install -and configure the adapter in a DOS-based system using the CS8900/20 Setup -Utility before installation in the target LINUX system. (Not required if -installing a CS8900-based adapter and the default configuration is acceptable.) - - -2.1 CS8900-BASED ADAPTER CONFIGURATION - -CS8900-based adapters shipped from Cirrus Logic have been configured -with the following "default" settings: - - Operation Mode: Memory Mode - IRQ: 10 - Base I/O Address: 300 - Memory Base Address: D0000 - Optimization: DOS Client - Transmission Mode: Half-duplex - BootProm: None - Media Type: Autodetect (3-media cards) or - 10BASE-T (10BASE-T only adapter) - -You should only change the default configuration settings if conflicts with -another adapter exists. To change the adapter's configuration, run the -CS8900/20 Setup Utility. - - -2.2 CS8920-BASED ADAPTER CONFIGURATION - -CS8920-based adapters are shipped from Cirrus Logic configured as Plug -and Play (PnP) enabled. However, since the cs89x0 driver does NOT -support PnP, you must install the CS8920 adapter in a DOS-based PC and -run the CS8900/20 Setup Utility to disable PnP and configure the -adapter before installation in the target Linux system. Failure to do -this will leave the adapter inactive and the driver will be unable to -communicate with the adapter. - - - **************************************************************** - * CS8920-BASED ADAPTERS: * - * * - * CS8920-BASED ADAPTERS ARE PLUG and PLAY ENABLED BY DEFAULT. * - * THE CS89X0 DRIVER DOES NOT SUPPORT PnP. THEREFORE, YOU MUST * - * RUN THE CS8900/20 SETUP UTILITY TO DISABLE PnP SUPPORT AND * - * TO ACTIVATE THE ADAPTER. * - **************************************************************** - - - - -3.0 LOADING THE DRIVER AS A MODULE -=============================================================================== - -If the driver is compiled as a loadable module, you can load the driver module -with the 'modprobe' command. Many of the adapter's configuration parameters can -be specified as command-line arguments to the load command. This facility -provides a means to override the EEPROM's settings or for interface -configuration when an EEPROM is not used. - -Example: - - insmod cs89x0.o io=0x200 irq=0xA media=aui - -This example loads the module and configures the adapter to use an IO port base -address of 200h, interrupt 10, and use the AUI media connection. The following -configuration options are available on the command line: - -* io=### - specify IO address (200h-360h) -* irq=## - specify interrupt level -* use_dma=1 - Enable DMA -* dma=# - specify dma channel (Driver is compiled to support - Rx DMA only) -* dmasize=# (16 or 64) - DMA size 16K or 64K. Default value is set to 16. -* media=rj45 - specify media type - or media=bnc - or media=aui - or media=auto -* duplex=full - specify forced half/full/autonegotiate duplex - or duplex=half - or duplex=auto -* debug=# - debug level (only available if the driver was compiled - for debugging) - -NOTES: - -a) If an EEPROM is present, any specified command-line parameter - will override the corresponding configuration value stored in - EEPROM. - -b) The "io" parameter must be specified on the command-line. - -c) The driver's hardware probe routine is designed to avoid - writing to I/O space until it knows that there is a cs89x0 - card at the written addresses. This could cause problems - with device probing. To avoid this behaviour, add one - to the `io=' module parameter. This doesn't actually change - the I/O address, but it is a flag to tell the driver - to partially initialise the hardware before trying to - identify the card. This could be dangerous if you are - not sure that there is a cs89x0 card at the provided address. - - For example, to scan for an adapter located at IO base 0x300, - specify an IO address of 0x301. - -d) The "duplex=auto" parameter is only supported for the CS8920. - -e) The minimum command-line configuration required if an EEPROM is - not present is: - - io - irq - media type (no autodetect) - -f) The following additional parameters are CS89XX defaults (values - used with no EEPROM or command-line argument). - - * DMA Burst = enabled - * IOCHRDY Enabled = enabled - * UseSA = enabled - * CS8900 defaults to half-duplex if not specified on command-line - * CS8920 defaults to autoneg if not specified on command-line - * Use reset defaults for other config parameters - * dma_mode = 0 - -g) You can use ifconfig to set the adapter's Ethernet address. - -h) Many Linux distributions use the 'modprobe' command to load - modules. This program uses the '/etc/conf.modules' file to - determine configuration information which is passed to a driver - module when it is loaded. All the configuration options which are - described above may be placed within /etc/conf.modules. - - For example: - - > cat /etc/conf.modules - ... - alias eth0 cs89x0 - options cs89x0 io=0x0200 dma=5 use_dma=1 - ... - - In this example we are telling the module system that the - ethernet driver for this machine should use the cs89x0 driver. We - are asking 'modprobe' to pass the 'io', 'dma' and 'use_dma' - arguments to the driver when it is loaded. - -i) Cirrus recommend that the cs89x0 use the ISA DMA channels 5, 6 or - 7. You will probably find that other DMA channels will not work. - -j) The cs89x0 supports DMA for receiving only. DMA mode is - significantly more efficient. Flooding a 400 MHz Celeron machine - with large ping packets consumes 82% of its CPU capacity in non-DMA - mode. With DMA this is reduced to 45%. - -k) If your Linux kernel was compiled with inbuilt plug-and-play - support you will be able to find information about the cs89x0 card - with the command - - cat /proc/isapnp - -l) If during DMA operation you find erratic behavior or network data - corruption you should use your PC's BIOS to slow the EISA bus clock. - -m) If the cs89x0 driver is compiled directly into the kernel - (non-modular) then its I/O address is automatically determined by - ISA bus probing. The IRQ number, media options, etc are determined - from the card's EEPROM. - -n) If the cs89x0 driver is compiled directly into the kernel, DMA - mode may be selected by providing the kernel with a boot option - 'cs89x0_dma=N' where 'N' is the desired DMA channel number (5, 6 or 7). - - Kernel boot options may be provided on the LILO command line: - - LILO boot: linux cs89x0_dma=5 - - or they may be placed in /etc/lilo.conf: - - image=/boot/bzImage-2.3.48 - append="cs89x0_dma=5" - label=linux - root=/dev/hda5 - read-only - - The DMA Rx buffer size is hardwired to 16 kbytes in this mode. - (64k mode is not available). - - -4.0 COMPILING THE DRIVER -=============================================================================== - -The cs89x0 driver can be compiled directly into the kernel or compiled into -a loadable device driver module. - - -4.1 COMPILING THE DRIVER AS A LOADABLE MODULE - -To compile the driver into a loadable module, use the following command -(single command line, without quotes): - -"gcc -D__KERNEL__ -I/usr/src/linux/include -I/usr/src/linux/net/inet -Wall --Wstrict-prototypes -O2 -fomit-frame-pointer -DMODULE -DCONFIG_MODVERSIONS --c cs89x0.c" - -4.2 COMPILING THE DRIVER TO SUPPORT MEMORY MODE - -Support for memory mode was not carried over into the 2.3 series kernels. - -4.3 COMPILING THE DRIVER TO SUPPORT Rx DMA - -The compile-time optionality for DMA was removed in the 2.3 kernel -series. DMA support is now unconditionally part of the driver. It is -enabled by the 'use_dma=1' module option. - - -5.0 TESTING AND TROUBLESHOOTING -=============================================================================== - -5.1 KNOWN DEFECTS and LIMITATIONS - -Refer to the RELEASE.TXT file distributed as part of this archive for a list of -known defects, driver limitations, and work arounds. - - -5.2 TESTING THE ADAPTER - -Once the adapter has been installed and configured, the diagnostic option of -the CS8900/20 Setup Utility can be used to test the functionality of the -adapter and its network connection. Use the diagnostics 'Self Test' option to -test the functionality of the adapter with the hardware configuration you have -assigned. You can use the diagnostics 'Network Test' to test the ability of the -adapter to communicate across the Ethernet with another PC equipped with a -CS8900/20-based adapter card (it must also be running the CS8900/20 Setup -Utility). - - NOTE: The Setup Utility's diagnostics are designed to run in a - DOS-only operating system environment. DO NOT run the diagnostics - from a DOS or command prompt session under Windows 95, Windows NT, - OS/2, or other operating system. - -To run the diagnostics tests on the CS8900/20 adapter: - - 1.) Boot DOS on the PC and start the CS8900/20 Setup Utility. - - 2.) The adapter's current configuration is displayed. Hit the ENTER key to - get to the main menu. - - 4.) Select 'Diagnostics' (ALT-G) from the main menu. - * Select 'Self-Test' to test the adapter's basic functionality. - * Select 'Network Test' to test the network connection and cabling. - - -5.2.1 DIAGNOSTIC SELF-TEST - -The diagnostic self-test checks the adapter's basic functionality as well as -its ability to communicate across the ISA bus based on the system resources -assigned during hardware configuration. The following tests are performed: - - * IO Register Read/Write Test - The IO Register Read/Write test insures that the CS8900/20 can be - accessed in IO mode, and that the IO base address is correct. - - * Shared Memory Test - The Shared Memory test insures the CS8900/20 can be accessed in memory - mode and that the range of memory addresses assigned does not conflict - with other devices in the system. - - * Interrupt Test - The Interrupt test insures there are no conflicts with the assigned IRQ - signal. - - * EEPROM Test - The EEPROM test insures the EEPROM can be read. - - * Chip RAM Test - The Chip RAM test insures the 4K of memory internal to the CS8900/20 is - working properly. - - * Internal Loop-back Test - The Internal Loop Back test insures the adapter's transmitter and - receiver are operating properly. If this test fails, make sure the - adapter's cable is connected to the network (check for LED activity for - example). - - * Boot PROM Test - The Boot PROM test insures the Boot PROM is present, and can be read. - Failure indicates the Boot PROM was not successfully read due to a - hardware problem or due to a conflicts on the Boot PROM address - assignment. (Test only applies if the adapter is configured to use the - Boot PROM option.) - -Failure of a test item indicates a possible system resource conflict with -another device on the ISA bus. In this case, you should use the Manual Setup -option to reconfigure the adapter by selecting a different value for the system -resource that failed. - - -5.2.2 DIAGNOSTIC NETWORK TEST - -The Diagnostic Network Test verifies a working network connection by -transferring data between two CS8900/20 adapters installed in different PCs -on the same network. (Note: the diagnostic network test should not be run -between two nodes across a router.) - -This test requires that each of the two PCs have a CS8900/20-based adapter -installed and have the CS8900/20 Setup Utility running. The first PC is -configured as a Responder and the other PC is configured as an Initiator. -Once the Initiator is started, it sends data frames to the Responder which -returns the frames to the Initiator. - -The total number of frames received and transmitted are displayed on the -Initiator's display, along with a count of the number of frames received and -transmitted OK or in error. The test can be terminated anytime by the user at -either PC. - -To setup the Diagnostic Network Test: - - 1.) Select a PC with a CS8900/20-based adapter and a known working network - connection to act as the Responder. Run the CS8900/20 Setup Utility - and select 'Diagnostics -> Network Test -> Responder' from the main - menu. Hit ENTER to start the Responder. - - 2.) Return to the PC with the CS8900/20-based adapter you want to test and - start the CS8900/20 Setup Utility. - - 3.) From the main menu, Select 'Diagnostic -> Network Test -> Initiator'. - Hit ENTER to start the test. - -You may stop the test on the Initiator at any time while allowing the Responder -to continue running. In this manner, you can move to additional PCs and test -them by starting the Initiator on another PC without having to stop/start the -Responder. - - - -5.3 USING THE ADAPTER'S LEDs - -The 2 and 3-media adapters have two LEDs visible on the back end of the board -located near the 10Base-T connector. - -Link Integrity LED: A "steady" ON of the green LED indicates a valid 10Base-T -connection. (Only applies to 10Base-T. The green LED has no significance for -a 10Base-2 or AUI connection.) - -TX/RX LED: The yellow LED lights briefly each time the adapter transmits or -receives data. (The yellow LED will appear to "flicker" on a typical network.) - - -5.4 RESOLVING I/O CONFLICTS - -An IO conflict occurs when two or more adapter use the same ISA resource (IO -address, memory address or IRQ). You can usually detect an IO conflict in one -of four ways after installing and or configuring the CS8900/20-based adapter: - - 1.) The system does not boot properly (or at all). - - 2.) The driver cannot communicate with the adapter, reporting an "Adapter - not found" error message. - - 3.) You cannot connect to the network or the driver will not load. - - 4.) If you have configured the adapter to run in memory mode but the driver - reports it is using IO mode when loading, this is an indication of a - memory address conflict. - -If an IO conflict occurs, run the CS8900/20 Setup Utility and perform a -diagnostic self-test. Normally, the ISA resource in conflict will fail the -self-test. If so, reconfigure the adapter selecting another choice for the -resource in conflict. Run the diagnostics again to check for further IO -conflicts. - -In some cases, such as when the PC will not boot, it may be necessary to remove -the adapter and reconfigure it by installing it in another PC to run the -CS8900/20 Setup Utility. Once reinstalled in the target system, run the -diagnostics self-test to ensure the new configuration is free of conflicts -before loading the driver again. - -When manually configuring the adapter, keep in mind the typical ISA system -resource usage as indicated in the tables below. - -I/O Address Device IRQ Device ------------ -------- --- -------- - 200-20F Game I/O adapter 3 COM2, Bus Mouse - 230-23F Bus Mouse 4 COM1 - 270-27F LPT3: third parallel port 5 LPT2 - 2F0-2FF COM2: second serial port 6 Floppy Disk controller - 320-32F Fixed disk controller 7 LPT1 - 8 Real-time Clock - 9 EGA/VGA display adapter - 12 Mouse (PS/2) -Memory Address Device 13 Math Coprocessor --------------- --------------------- 14 Hard Disk controller -A000-BFFF EGA Graphics Adapter -A000-C7FF VGA Graphics Adapter -B000-BFFF Mono Graphics Adapter -B800-BFFF Color Graphics Adapter -E000-FFFF AT BIOS - - - - -6.0 TECHNICAL SUPPORT -=============================================================================== - -6.1 CONTACTING CIRRUS LOGIC'S TECHNICAL SUPPORT - -Cirrus Logic's CS89XX Technical Application Support can be reached at: - -Telephone :(800) 888-5016 (from inside U.S. and Canada) - :(512) 442-7555 (from outside the U.S. and Canada) -Fax :(512) 912-3871 -Email :ethernet@crystal.cirrus.com -WWW :http://www.cirrus.com - - -6.2 INFORMATION REQUIRED BEFORE CONTACTING TECHNICAL SUPPORT - -Before contacting Cirrus Logic for technical support, be prepared to provide as -Much of the following information as possible. - -1.) Adapter type (CRD8900, CDB8900, CDB8920, etc.) - -2.) Adapter configuration - - * IO Base, Memory Base, IO or memory mode enabled, IRQ, DMA channel - * Plug and Play enabled/disabled (CS8920-based adapters only) - * Configured for media auto-detect or specific media type (which type). - -3.) PC System's Configuration - - * Plug and Play system (yes/no) - * BIOS (make and version) - * System make and model - * CPU (type and speed) - * System RAM - * SCSI Adapter - -4.) Software - - * CS89XX driver and version - * Your network operating system and version - * Your system's OS version - * Version of all protocol support files - -5.) Any Error Message displayed. - - - -6.3 OBTAINING THE LATEST DRIVER VERSION - -You can obtain the latest CS89XX drivers and support software from Cirrus Logic's -Web site. You can also contact Cirrus Logic's Technical Support (email: -ethernet@crystal.cirrus.com) and request that you be registered for automatic -software-update notification. - -Cirrus Logic maintains a web page at http://www.cirrus.com with the -latest drivers and technical publications. - - -6.4 Current maintainer - -In February 2000 the maintenance of this driver was assumed by Andrew -Morton. - -6.5 Kernel module parameters - -For use in embedded environments with no cs89x0 EEPROM, the kernel boot -parameter `cs89x0_media=' has been implemented. Usage is: - - cs89x0_media=rj45 or - cs89x0_media=aui or - cs89x0_media=bnc - diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst index 23c4ec9c9125..0b39342e2a1f 100644 --- a/Documentation/networking/device_drivers/index.rst +++ b/Documentation/networking/device_drivers/index.rst @@ -32,6 +32,7 @@ Contents: amazon/ena aquantia/atlantic chelsio/cxgb + cirrus/cs89x0 .. only:: subproject and html diff --git a/drivers/net/ethernet/cirrus/Kconfig b/drivers/net/ethernet/cirrus/Kconfig index 48f3198381bc..8d845f5ee0c5 100644 --- a/drivers/net/ethernet/cirrus/Kconfig +++ b/drivers/net/ethernet/cirrus/Kconfig @@ -24,7 +24,7 @@ config CS89x0 ---help--- Support for CS89x0 chipset based Ethernet cards. If you have a network (Ethernet) card of this type, say Y and read the file - . + . To compile this driver as a module, choose M here. The module will be called cs89x0. -- cgit v1.2.3 From e1ddedb5cbd6f8ec2f41874bc06e03023fbd9d99 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:40 +0200 Subject: docs: networking: device drivers: convert davicom/dm9000.txt to ReST - add SPDX header; - add a document title; - mark lists as such; - mark tables as such; - mark code blocks and literals as such; - use the right horizontal tag markup; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- .../networking/device_drivers/davicom/dm9000.rst | 171 +++++++++++++++++++++ .../networking/device_drivers/davicom/dm9000.txt | 167 -------------------- Documentation/networking/device_drivers/index.rst | 1 + 3 files changed, 172 insertions(+), 167 deletions(-) create mode 100644 Documentation/networking/device_drivers/davicom/dm9000.rst delete mode 100644 Documentation/networking/device_drivers/davicom/dm9000.txt (limited to 'Documentation') diff --git a/Documentation/networking/device_drivers/davicom/dm9000.rst b/Documentation/networking/device_drivers/davicom/dm9000.rst new file mode 100644 index 000000000000..d5458da01083 --- /dev/null +++ b/Documentation/networking/device_drivers/davicom/dm9000.rst @@ -0,0 +1,171 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +DM9000 Network driver +===================== + +Copyright 2008 Simtec Electronics, + + Ben Dooks + + +Introduction +------------ + +This file describes how to use the DM9000 platform-device based network driver +that is contained in the files drivers/net/dm9000.c and drivers/net/dm9000.h. + +The driver supports three DM9000 variants, the DM9000E which is the first chip +supported as well as the newer DM9000A and DM9000B devices. It is currently +maintained and tested by Ben Dooks, who should be CC: to any patches for this +driver. + + +Defining the platform device +---------------------------- + +The minimum set of resources attached to the platform device are as follows: + + 1) The physical address of the address register + 2) The physical address of the data register + 3) The IRQ line the device's interrupt pin is connected to. + +These resources should be specified in that order, as the ordering of the +two address regions is important (the driver expects these to be address +and then data). + +An example from arch/arm/mach-s3c2410/mach-bast.c is:: + + static struct resource bast_dm9k_resource[] = { + [0] = { + .start = S3C2410_CS5 + BAST_PA_DM9000, + .end = S3C2410_CS5 + BAST_PA_DM9000 + 3, + .flags = IORESOURCE_MEM, + }, + [1] = { + .start = S3C2410_CS5 + BAST_PA_DM9000 + 0x40, + .end = S3C2410_CS5 + BAST_PA_DM9000 + 0x40 + 0x3f, + .flags = IORESOURCE_MEM, + }, + [2] = { + .start = IRQ_DM9000, + .end = IRQ_DM9000, + .flags = IORESOURCE_IRQ | IORESOURCE_IRQ_HIGHLEVEL, + } + }; + + static struct platform_device bast_device_dm9k = { + .name = "dm9000", + .id = 0, + .num_resources = ARRAY_SIZE(bast_dm9k_resource), + .resource = bast_dm9k_resource, + }; + +Note the setting of the IRQ trigger flag in bast_dm9k_resource[2].flags, +as this will generate a warning if it is not present. The trigger from +the flags field will be passed to request_irq() when registering the IRQ +handler to ensure that the IRQ is setup correctly. + +This shows a typical platform device, without the optional configuration +platform data supplied. The next example uses the same resources, but adds +the optional platform data to pass extra configuration data:: + + static struct dm9000_plat_data bast_dm9k_platdata = { + .flags = DM9000_PLATF_16BITONLY, + }; + + static struct platform_device bast_device_dm9k = { + .name = "dm9000", + .id = 0, + .num_resources = ARRAY_SIZE(bast_dm9k_resource), + .resource = bast_dm9k_resource, + .dev = { + .platform_data = &bast_dm9k_platdata, + } + }; + +The platform data is defined in include/linux/dm9000.h and described below. + + +Platform data +------------- + +Extra platform data for the DM9000 can describe the IO bus width to the +device, whether or not an external PHY is attached to the device and +the availability of an external configuration EEPROM. + +The flags for the platform data .flags field are as follows: + +DM9000_PLATF_8BITONLY + + The IO should be done with 8bit operations. + +DM9000_PLATF_16BITONLY + + The IO should be done with 16bit operations. + +DM9000_PLATF_32BITONLY + + The IO should be done with 32bit operations. + +DM9000_PLATF_EXT_PHY + + The chip is connected to an external PHY. + +DM9000_PLATF_NO_EEPROM + + This can be used to signify that the board does not have an + EEPROM, or that the EEPROM should be hidden from the user. + +DM9000_PLATF_SIMPLE_PHY + + Switch to using the simpler PHY polling method which does not + try and read the MII PHY state regularly. This is only available + when using the internal PHY. See the section on link state polling + for more information. + + The config symbol DM9000_FORCE_SIMPLE_PHY_POLL, Kconfig entry + "Force simple NSR based PHY polling" allows this flag to be + forced on at build time. + + +PHY Link state polling +---------------------- + +The driver keeps track of the link state and informs the network core +about link (carrier) availability. This is managed by several methods +depending on the version of the chip and on which PHY is being used. + +For the internal PHY, the original (and currently default) method is +to read the MII state, either when the status changes if we have the +necessary interrupt support in the chip or every two seconds via a +periodic timer. + +To reduce the overhead for the internal PHY, there is now the option +of using the DM9000_FORCE_SIMPLE_PHY_POLL config, or DM9000_PLATF_SIMPLE_PHY +platform data option to read the summary information without the +expensive MII accesses. This method is faster, but does not print +as much information. + +When using an external PHY, the driver currently has to poll the MII +link status as there is no method for getting an interrupt on link change. + + +DM9000A / DM9000B +----------------- + +These chips are functionally similar to the DM9000E and are supported easily +by the same driver. The features are: + + 1) Interrupt on internal PHY state change. This means that the periodic + polling of the PHY status may be disabled on these devices when using + the internal PHY. + + 2) TCP/UDP checksum offloading, which the driver does not currently support. + + +ethtool +------- + +The driver supports the ethtool interface for access to the driver +state information, the PHY state and the EEPROM. diff --git a/Documentation/networking/device_drivers/davicom/dm9000.txt b/Documentation/networking/device_drivers/davicom/dm9000.txt deleted file mode 100644 index 5552e2e575c5..000000000000 --- a/Documentation/networking/device_drivers/davicom/dm9000.txt +++ /dev/null @@ -1,167 +0,0 @@ -DM9000 Network driver -===================== - -Copyright 2008 Simtec Electronics, - Ben Dooks - - -Introduction ------------- - -This file describes how to use the DM9000 platform-device based network driver -that is contained in the files drivers/net/dm9000.c and drivers/net/dm9000.h. - -The driver supports three DM9000 variants, the DM9000E which is the first chip -supported as well as the newer DM9000A and DM9000B devices. It is currently -maintained and tested by Ben Dooks, who should be CC: to any patches for this -driver. - - -Defining the platform device ----------------------------- - -The minimum set of resources attached to the platform device are as follows: - - 1) The physical address of the address register - 2) The physical address of the data register - 3) The IRQ line the device's interrupt pin is connected to. - -These resources should be specified in that order, as the ordering of the -two address regions is important (the driver expects these to be address -and then data). - -An example from arch/arm/mach-s3c2410/mach-bast.c is: - -static struct resource bast_dm9k_resource[] = { - [0] = { - .start = S3C2410_CS5 + BAST_PA_DM9000, - .end = S3C2410_CS5 + BAST_PA_DM9000 + 3, - .flags = IORESOURCE_MEM, - }, - [1] = { - .start = S3C2410_CS5 + BAST_PA_DM9000 + 0x40, - .end = S3C2410_CS5 + BAST_PA_DM9000 + 0x40 + 0x3f, - .flags = IORESOURCE_MEM, - }, - [2] = { - .start = IRQ_DM9000, - .end = IRQ_DM9000, - .flags = IORESOURCE_IRQ | IORESOURCE_IRQ_HIGHLEVEL, - } -}; - -static struct platform_device bast_device_dm9k = { - .name = "dm9000", - .id = 0, - .num_resources = ARRAY_SIZE(bast_dm9k_resource), - .resource = bast_dm9k_resource, -}; - -Note the setting of the IRQ trigger flag in bast_dm9k_resource[2].flags, -as this will generate a warning if it is not present. The trigger from -the flags field will be passed to request_irq() when registering the IRQ -handler to ensure that the IRQ is setup correctly. - -This shows a typical platform device, without the optional configuration -platform data supplied. The next example uses the same resources, but adds -the optional platform data to pass extra configuration data: - -static struct dm9000_plat_data bast_dm9k_platdata = { - .flags = DM9000_PLATF_16BITONLY, -}; - -static struct platform_device bast_device_dm9k = { - .name = "dm9000", - .id = 0, - .num_resources = ARRAY_SIZE(bast_dm9k_resource), - .resource = bast_dm9k_resource, - .dev = { - .platform_data = &bast_dm9k_platdata, - } -}; - -The platform data is defined in include/linux/dm9000.h and described below. - - -Platform data -------------- - -Extra platform data for the DM9000 can describe the IO bus width to the -device, whether or not an external PHY is attached to the device and -the availability of an external configuration EEPROM. - -The flags for the platform data .flags field are as follows: - -DM9000_PLATF_8BITONLY - - The IO should be done with 8bit operations. - -DM9000_PLATF_16BITONLY - - The IO should be done with 16bit operations. - -DM9000_PLATF_32BITONLY - - The IO should be done with 32bit operations. - -DM9000_PLATF_EXT_PHY - - The chip is connected to an external PHY. - -DM9000_PLATF_NO_EEPROM - - This can be used to signify that the board does not have an - EEPROM, or that the EEPROM should be hidden from the user. - -DM9000_PLATF_SIMPLE_PHY - - Switch to using the simpler PHY polling method which does not - try and read the MII PHY state regularly. This is only available - when using the internal PHY. See the section on link state polling - for more information. - - The config symbol DM9000_FORCE_SIMPLE_PHY_POLL, Kconfig entry - "Force simple NSR based PHY polling" allows this flag to be - forced on at build time. - - -PHY Link state polling ----------------------- - -The driver keeps track of the link state and informs the network core -about link (carrier) availability. This is managed by several methods -depending on the version of the chip and on which PHY is being used. - -For the internal PHY, the original (and currently default) method is -to read the MII state, either when the status changes if we have the -necessary interrupt support in the chip or every two seconds via a -periodic timer. - -To reduce the overhead for the internal PHY, there is now the option -of using the DM9000_FORCE_SIMPLE_PHY_POLL config, or DM9000_PLATF_SIMPLE_PHY -platform data option to read the summary information without the -expensive MII accesses. This method is faster, but does not print -as much information. - -When using an external PHY, the driver currently has to poll the MII -link status as there is no method for getting an interrupt on link change. - - -DM9000A / DM9000B ------------------ - -These chips are functionally similar to the DM9000E and are supported easily -by the same driver. The features are: - - 1) Interrupt on internal PHY state change. This means that the periodic - polling of the PHY status may be disabled on these devices when using - the internal PHY. - - 2) TCP/UDP checksum offloading, which the driver does not currently support. - - -ethtool -------- - -The driver supports the ethtool interface for access to the driver -state information, the PHY state and the EEPROM. diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst index 0b39342e2a1f..e8db57fef2e9 100644 --- a/Documentation/networking/device_drivers/index.rst +++ b/Documentation/networking/device_drivers/index.rst @@ -33,6 +33,7 @@ Contents: aquantia/atlantic chelsio/cxgb cirrus/cs89x0 + davicom/dm9000 .. only:: subproject and html -- cgit v1.2.3 From b6671d71ca811aed02f136a6cd812a542f88c483 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:41 +0200 Subject: docs: networking: device drivers: convert dec/de4x5.txt to ReST - add SPDX header; - add a document title; - mark code blocks and literals as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- .../networking/device_drivers/dec/de4x5.rst | 189 +++++++++++++++++++++ .../networking/device_drivers/dec/de4x5.txt | 178 ------------------- Documentation/networking/device_drivers/index.rst | 1 + drivers/net/ethernet/dec/tulip/Kconfig | 2 +- 4 files changed, 191 insertions(+), 179 deletions(-) create mode 100644 Documentation/networking/device_drivers/dec/de4x5.rst delete mode 100644 Documentation/networking/device_drivers/dec/de4x5.txt (limited to 'Documentation') diff --git a/Documentation/networking/device_drivers/dec/de4x5.rst b/Documentation/networking/device_drivers/dec/de4x5.rst new file mode 100644 index 000000000000..e03e9c631879 --- /dev/null +++ b/Documentation/networking/device_drivers/dec/de4x5.rst @@ -0,0 +1,189 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================================== +DEC EtherWORKS Ethernet De4x5 cards +=================================== + + Originally, this driver was written for the Digital Equipment + Corporation series of EtherWORKS Ethernet cards: + + - DE425 TP/COAX EISA + - DE434 TP PCI + - DE435 TP/COAX/AUI PCI + - DE450 TP/COAX/AUI PCI + - DE500 10/100 PCI Fasternet + + but it will now attempt to support all cards which conform to the + Digital Semiconductor SROM Specification. The driver currently + recognises the following chips: + + - DC21040 (no SROM) + - DC21041[A] + - DC21140[A] + - DC21142 + - DC21143 + + So far the driver is known to work with the following cards: + + - KINGSTON + - Linksys + - ZNYX342 + - SMC8432 + - SMC9332 (w/new SROM) + - ZNYX31[45] + - ZNYX346 10/100 4 port (can act as a 10/100 bridge!) + + The driver has been tested on a relatively busy network using the DE425, + DE434, DE435 and DE500 cards and benchmarked with 'ttcp': it transferred + 16M of data to a DECstation 5000/200 as follows:: + + TCP UDP + TX RX TX RX + DE425 1030k 997k 1170k 1128k + DE434 1063k 995k 1170k 1125k + DE435 1063k 995k 1170k 1125k + DE500 1063k 998k 1170k 1125k in 10Mb/s mode + + All values are typical (in kBytes/sec) from a sample of 4 for each + measurement. Their error is +/-20k on a quiet (private) network and also + depend on what load the CPU has. + +---------------------------------------------------------------------------- + + The ability to load this driver as a loadable module has been included + and used extensively during the driver development (to save those long + reboot sequences). Loadable module support under PCI and EISA has been + achieved by letting the driver autoprobe as if it were compiled into the + kernel. Do make sure you're not sharing interrupts with anything that + cannot accommodate interrupt sharing! + + To utilise this ability, you have to do 8 things: + + 0) have a copy of the loadable modules code installed on your system. + 1) copy de4x5.c from the /linux/drivers/net directory to your favourite + temporary directory. + 2) for fixed autoprobes (not recommended), edit the source code near + line 5594 to reflect the I/O address you're using, or assign these when + loading by:: + + insmod de4x5 io=0xghh where g = bus number + hh = device number + + .. note:: + + autoprobing for modules is now supported by default. You may just + use:: + + insmod de4x5 + + to load all available boards. For a specific board, still use + the 'io=?' above. + 3) compile de4x5.c, but include -DMODULE in the command line to ensure + that the correct bits are compiled (see end of source code). + 4) if you are wanting to add a new card, goto 5. Otherwise, recompile a + kernel with the de4x5 configuration turned off and reboot. + 5) insmod de4x5 [io=0xghh] + 6) run the net startup bits for your new eth?? interface(s) manually + (usually /etc/rc.inet[12] at boot time). + 7) enjoy! + + To unload a module, turn off the associated interface(s) + 'ifconfig eth?? down' then 'rmmod de4x5'. + + Automedia detection is included so that in principle you can disconnect + from, e.g. TP, reconnect to BNC and things will still work (after a + pause while the driver figures out where its media went). My tests + using ping showed that it appears to work.... + + By default, the driver will now autodetect any DECchip based card. + Should you have a need to restrict the driver to DIGITAL only cards, you + can compile with a DEC_ONLY define, or if loading as a module, use the + 'dec_only=1' parameter. + + I've changed the timing routines to use the kernel timer and scheduling + functions so that the hangs and other assorted problems that occurred + while autosensing the media should be gone. A bonus for the DC21040 + auto media sense algorithm is that it can now use one that is more in + line with the rest (the DC21040 chip doesn't have a hardware timer). + The downside is the 1 'jiffies' (10ms) resolution. + + IEEE 802.3u MII interface code has been added in anticipation that some + products may use it in the future. + + The SMC9332 card has a non-compliant SROM which needs fixing - I have + patched this driver to detect it because the SROM format used complies + to a previous DEC-STD format. + + I have removed the buffer copies needed for receive on Intels. I cannot + remove them for Alphas since the Tulip hardware only does longword + aligned DMA transfers and the Alphas get alignment traps with non + longword aligned data copies (which makes them really slow). No comment. + + I have added SROM decoding routines to make this driver work with any + card that supports the Digital Semiconductor SROM spec. This will help + all cards running the dc2114x series chips in particular. Cards using + the dc2104x chips should run correctly with the basic driver. I'm in + debt to for the testing and feedback that helped get + this feature working. So far we have tested KINGSTON, SMC8432, SMC9332 + (with the latest SROM complying with the SROM spec V3: their first was + broken), ZNYX342 and LinkSys. ZNYX314 (dual 21041 MAC) and ZNYX 315 + (quad 21041 MAC) cards also appear to work despite their incorrectly + wired IRQs. + + I have added a temporary fix for interrupt problems when some SCSI cards + share the same interrupt as the DECchip based cards. The problem occurs + because the SCSI card wants to grab the interrupt as a fast interrupt + (runs the service routine with interrupts turned off) vs. this card + which really needs to run the service routine with interrupts turned on. + This driver will now add the interrupt service routine as a fast + interrupt if it is bounced from the slow interrupt. THIS IS NOT A + RECOMMENDED WAY TO RUN THE DRIVER and has been done for a limited time + until people sort out their compatibility issues and the kernel + interrupt service code is fixed. YOU SHOULD SEPARATE OUT THE FAST + INTERRUPT CARDS FROM THE SLOW INTERRUPT CARDS to ensure that they do not + run on the same interrupt. PCMCIA/CardBus is another can of worms... + + Finally, I think I have really fixed the module loading problem with + more than one DECchip based card. As a side effect, I don't mess with + the device structure any more which means that if more than 1 card in + 2.0.x is installed (4 in 2.1.x), the user will have to edit + linux/drivers/net/Space.c to make room for them. Hence, module loading + is the preferred way to use this driver, since it doesn't have this + limitation. + + Where SROM media detection is used and full duplex is specified in the + SROM, the feature is ignored unless lp->params.fdx is set at compile + time OR during a module load (insmod de4x5 args='eth??:fdx' [see + below]). This is because there is no way to automatically detect full + duplex links except through autonegotiation. When I include the + autonegotiation feature in the SROM autoconf code, this detection will + occur automatically for that case. + + Command line arguments are now allowed, similar to passing arguments + through LILO. This will allow a per adapter board set up of full duplex + and media. The only lexical constraints are: the board name (dev->name) + appears in the list before its parameters. The list of parameters ends + either at the end of the parameter list or with another board name. The + following parameters are allowed: + + ========= =============================================== + fdx for full duplex + autosense to set the media/speed; with the following + sub-parameters: + TP, TP_NW, BNC, AUI, BNC_AUI, 100Mb, 10Mb, AUTO + ========= =============================================== + + Case sensitivity is important for the sub-parameters. They *must* be + upper case. Examples:: + + insmod de4x5 args='eth1:fdx autosense=BNC eth0:autosense=100Mb'. + + For a compiled in driver, in linux/drivers/net/CONFIG, place e.g.:: + + DE4X5_OPTS = -DDE4X5_PARM='"eth0:fdx autosense=AUI eth2:autosense=TP"' + + Yes, I know full duplex isn't permissible on BNC or AUI; they're just + examples. By default, full duplex is turned off and AUTO is the default + autosense setting. In reality, I expect only the full duplex option to + be used. Note the use of single quotes in the two examples above and the + lack of commas to separate items. diff --git a/Documentation/networking/device_drivers/dec/de4x5.txt b/Documentation/networking/device_drivers/dec/de4x5.txt deleted file mode 100644 index 452aac58341d..000000000000 --- a/Documentation/networking/device_drivers/dec/de4x5.txt +++ /dev/null @@ -1,178 +0,0 @@ - Originally, this driver was written for the Digital Equipment - Corporation series of EtherWORKS Ethernet cards: - - DE425 TP/COAX EISA - DE434 TP PCI - DE435 TP/COAX/AUI PCI - DE450 TP/COAX/AUI PCI - DE500 10/100 PCI Fasternet - - but it will now attempt to support all cards which conform to the - Digital Semiconductor SROM Specification. The driver currently - recognises the following chips: - - DC21040 (no SROM) - DC21041[A] - DC21140[A] - DC21142 - DC21143 - - So far the driver is known to work with the following cards: - - KINGSTON - Linksys - ZNYX342 - SMC8432 - SMC9332 (w/new SROM) - ZNYX31[45] - ZNYX346 10/100 4 port (can act as a 10/100 bridge!) - - The driver has been tested on a relatively busy network using the DE425, - DE434, DE435 and DE500 cards and benchmarked with 'ttcp': it transferred - 16M of data to a DECstation 5000/200 as follows: - - TCP UDP - TX RX TX RX - DE425 1030k 997k 1170k 1128k - DE434 1063k 995k 1170k 1125k - DE435 1063k 995k 1170k 1125k - DE500 1063k 998k 1170k 1125k in 10Mb/s mode - - All values are typical (in kBytes/sec) from a sample of 4 for each - measurement. Their error is +/-20k on a quiet (private) network and also - depend on what load the CPU has. - - ========================================================================= - - The ability to load this driver as a loadable module has been included - and used extensively during the driver development (to save those long - reboot sequences). Loadable module support under PCI and EISA has been - achieved by letting the driver autoprobe as if it were compiled into the - kernel. Do make sure you're not sharing interrupts with anything that - cannot accommodate interrupt sharing! - - To utilise this ability, you have to do 8 things: - - 0) have a copy of the loadable modules code installed on your system. - 1) copy de4x5.c from the /linux/drivers/net directory to your favourite - temporary directory. - 2) for fixed autoprobes (not recommended), edit the source code near - line 5594 to reflect the I/O address you're using, or assign these when - loading by: - - insmod de4x5 io=0xghh where g = bus number - hh = device number - - NB: autoprobing for modules is now supported by default. You may just - use: - - insmod de4x5 - - to load all available boards. For a specific board, still use - the 'io=?' above. - 3) compile de4x5.c, but include -DMODULE in the command line to ensure - that the correct bits are compiled (see end of source code). - 4) if you are wanting to add a new card, goto 5. Otherwise, recompile a - kernel with the de4x5 configuration turned off and reboot. - 5) insmod de4x5 [io=0xghh] - 6) run the net startup bits for your new eth?? interface(s) manually - (usually /etc/rc.inet[12] at boot time). - 7) enjoy! - - To unload a module, turn off the associated interface(s) - 'ifconfig eth?? down' then 'rmmod de4x5'. - - Automedia detection is included so that in principle you can disconnect - from, e.g. TP, reconnect to BNC and things will still work (after a - pause while the driver figures out where its media went). My tests - using ping showed that it appears to work.... - - By default, the driver will now autodetect any DECchip based card. - Should you have a need to restrict the driver to DIGITAL only cards, you - can compile with a DEC_ONLY define, or if loading as a module, use the - 'dec_only=1' parameter. - - I've changed the timing routines to use the kernel timer and scheduling - functions so that the hangs and other assorted problems that occurred - while autosensing the media should be gone. A bonus for the DC21040 - auto media sense algorithm is that it can now use one that is more in - line with the rest (the DC21040 chip doesn't have a hardware timer). - The downside is the 1 'jiffies' (10ms) resolution. - - IEEE 802.3u MII interface code has been added in anticipation that some - products may use it in the future. - - The SMC9332 card has a non-compliant SROM which needs fixing - I have - patched this driver to detect it because the SROM format used complies - to a previous DEC-STD format. - - I have removed the buffer copies needed for receive on Intels. I cannot - remove them for Alphas since the Tulip hardware only does longword - aligned DMA transfers and the Alphas get alignment traps with non - longword aligned data copies (which makes them really slow). No comment. - - I have added SROM decoding routines to make this driver work with any - card that supports the Digital Semiconductor SROM spec. This will help - all cards running the dc2114x series chips in particular. Cards using - the dc2104x chips should run correctly with the basic driver. I'm in - debt to for the testing and feedback that helped get - this feature working. So far we have tested KINGSTON, SMC8432, SMC9332 - (with the latest SROM complying with the SROM spec V3: their first was - broken), ZNYX342 and LinkSys. ZNYX314 (dual 21041 MAC) and ZNYX 315 - (quad 21041 MAC) cards also appear to work despite their incorrectly - wired IRQs. - - I have added a temporary fix for interrupt problems when some SCSI cards - share the same interrupt as the DECchip based cards. The problem occurs - because the SCSI card wants to grab the interrupt as a fast interrupt - (runs the service routine with interrupts turned off) vs. this card - which really needs to run the service routine with interrupts turned on. - This driver will now add the interrupt service routine as a fast - interrupt if it is bounced from the slow interrupt. THIS IS NOT A - RECOMMENDED WAY TO RUN THE DRIVER and has been done for a limited time - until people sort out their compatibility issues and the kernel - interrupt service code is fixed. YOU SHOULD SEPARATE OUT THE FAST - INTERRUPT CARDS FROM THE SLOW INTERRUPT CARDS to ensure that they do not - run on the same interrupt. PCMCIA/CardBus is another can of worms... - - Finally, I think I have really fixed the module loading problem with - more than one DECchip based card. As a side effect, I don't mess with - the device structure any more which means that if more than 1 card in - 2.0.x is installed (4 in 2.1.x), the user will have to edit - linux/drivers/net/Space.c to make room for them. Hence, module loading - is the preferred way to use this driver, since it doesn't have this - limitation. - - Where SROM media detection is used and full duplex is specified in the - SROM, the feature is ignored unless lp->params.fdx is set at compile - time OR during a module load (insmod de4x5 args='eth??:fdx' [see - below]). This is because there is no way to automatically detect full - duplex links except through autonegotiation. When I include the - autonegotiation feature in the SROM autoconf code, this detection will - occur automatically for that case. - - Command line arguments are now allowed, similar to passing arguments - through LILO. This will allow a per adapter board set up of full duplex - and media. The only lexical constraints are: the board name (dev->name) - appears in the list before its parameters. The list of parameters ends - either at the end of the parameter list or with another board name. The - following parameters are allowed: - - fdx for full duplex - autosense to set the media/speed; with the following - sub-parameters: - TP, TP_NW, BNC, AUI, BNC_AUI, 100Mb, 10Mb, AUTO - - Case sensitivity is important for the sub-parameters. They *must* be - upper case. Examples: - - insmod de4x5 args='eth1:fdx autosense=BNC eth0:autosense=100Mb'. - - For a compiled in driver, in linux/drivers/net/CONFIG, place e.g. - DE4X5_OPTS = -DDE4X5_PARM='"eth0:fdx autosense=AUI eth2:autosense=TP"' - - Yes, I know full duplex isn't permissible on BNC or AUI; they're just - examples. By default, full duplex is turned off and AUTO is the default - autosense setting. In reality, I expect only the full duplex option to - be used. Note the use of single quotes in the two examples above and the - lack of commas to separate items. diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst index e8db57fef2e9..4ad13ffb5800 100644 --- a/Documentation/networking/device_drivers/index.rst +++ b/Documentation/networking/device_drivers/index.rst @@ -34,6 +34,7 @@ Contents: chelsio/cxgb cirrus/cs89x0 davicom/dm9000 + dec/de4x5 .. only:: subproject and html diff --git a/drivers/net/ethernet/dec/tulip/Kconfig b/drivers/net/ethernet/dec/tulip/Kconfig index 8ce6888ea722..8c4245d94bb2 100644 --- a/drivers/net/ethernet/dec/tulip/Kconfig +++ b/drivers/net/ethernet/dec/tulip/Kconfig @@ -114,7 +114,7 @@ config DE4X5 These include the DE425, DE434, DE435, DE450 and DE500 models. If you have a network card of this type, say Y. More specific information is contained in - . + . To compile this driver as a module, choose M here. The module will be called de4x5. -- cgit v1.2.3 From c981977d3a5ce55c96b1b77f42d0a9df0a79244e Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:42 +0200 Subject: docs: networking: device drivers: convert dec/dmfe.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - comment out text-only TOC from html/pdf output; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- .../networking/device_drivers/dec/dmfe.rst | 71 ++++++++++++++++++++++ .../networking/device_drivers/dec/dmfe.txt | 66 -------------------- Documentation/networking/device_drivers/index.rst | 1 + MAINTAINERS | 2 +- drivers/net/ethernet/dec/tulip/Kconfig | 2 +- 5 files changed, 74 insertions(+), 68 deletions(-) create mode 100644 Documentation/networking/device_drivers/dec/dmfe.rst delete mode 100644 Documentation/networking/device_drivers/dec/dmfe.txt (limited to 'Documentation') diff --git a/Documentation/networking/device_drivers/dec/dmfe.rst b/Documentation/networking/device_drivers/dec/dmfe.rst new file mode 100644 index 000000000000..c4cf809cad84 --- /dev/null +++ b/Documentation/networking/device_drivers/dec/dmfe.rst @@ -0,0 +1,71 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================================== +Davicom DM9102(A)/DM9132/DM9801 fast ethernet driver for Linux +============================================================== + +Note: This driver doesn't have a maintainer. + + +This program is free software; you can redistribute it and/or +modify it under the terms of the GNU General Public License +as published by the Free Software Foundation; either version 2 +of the License, or (at your option) any later version. + +This program is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + + +This driver provides kernel support for Davicom DM9102(A)/DM9132/DM9801 ethernet cards ( CNET +10/100 ethernet cards uses Davicom chipset too, so this driver supports CNET cards too ).If you +didn't compile this driver as a module, it will automatically load itself on boot and print a +line similar to:: + + dmfe: Davicom DM9xxx net driver, version 1.36.4 (2002-01-17) + +If you compiled this driver as a module, you have to load it on boot.You can load it with command:: + + insmod dmfe + +This way it will autodetect the device mode.This is the suggested way to load the module.Or you can pass +a mode= setting to module while loading, like:: + + insmod dmfe mode=0 # Force 10M Half Duplex + insmod dmfe mode=1 # Force 100M Half Duplex + insmod dmfe mode=4 # Force 10M Full Duplex + insmod dmfe mode=5 # Force 100M Full Duplex + +Next you should configure your network interface with a command similar to:: + + ifconfig eth0 172.22.3.18 + ^^^^^^^^^^^ + Your IP Address + +Then you may have to modify the default routing table with command:: + + route add default eth0 + + +Now your ethernet card should be up and running. + + +TODO: + +- Implement pci_driver::suspend() and pci_driver::resume() power management methods. +- Check on 64 bit boxes. +- Check and fix on big endian boxes. +- Test and make sure PCI latency is now correct for all cases. + + +Authors: + +Sten Wang : Original Author + +Contributors: + +- Marcelo Tosatti +- Alan Cox +- Jeff Garzik +- Vojtech Pavlik diff --git a/Documentation/networking/device_drivers/dec/dmfe.txt b/Documentation/networking/device_drivers/dec/dmfe.txt deleted file mode 100644 index 25320bf19c86..000000000000 --- a/Documentation/networking/device_drivers/dec/dmfe.txt +++ /dev/null @@ -1,66 +0,0 @@ -Note: This driver doesn't have a maintainer. - -Davicom DM9102(A)/DM9132/DM9801 fast ethernet driver for Linux. - -This program is free software; you can redistribute it and/or -modify it under the terms of the GNU General Public License -as published by the Free Software Foundation; either version 2 -of the License, or (at your option) any later version. - -This program is distributed in the hope that it will be useful, -but WITHOUT ANY WARRANTY; without even the implied warranty of -MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -GNU General Public License for more details. - - -This driver provides kernel support for Davicom DM9102(A)/DM9132/DM9801 ethernet cards ( CNET -10/100 ethernet cards uses Davicom chipset too, so this driver supports CNET cards too ).If you -didn't compile this driver as a module, it will automatically load itself on boot and print a -line similar to : - - dmfe: Davicom DM9xxx net driver, version 1.36.4 (2002-01-17) - -If you compiled this driver as a module, you have to load it on boot.You can load it with command : - - insmod dmfe - -This way it will autodetect the device mode.This is the suggested way to load the module.Or you can pass -a mode= setting to module while loading, like : - - insmod dmfe mode=0 # Force 10M Half Duplex - insmod dmfe mode=1 # Force 100M Half Duplex - insmod dmfe mode=4 # Force 10M Full Duplex - insmod dmfe mode=5 # Force 100M Full Duplex - -Next you should configure your network interface with a command similar to : - - ifconfig eth0 172.22.3.18 - ^^^^^^^^^^^ - Your IP Address - -Then you may have to modify the default routing table with command : - - route add default eth0 - - -Now your ethernet card should be up and running. - - -TODO: - -Implement pci_driver::suspend() and pci_driver::resume() power management methods. -Check on 64 bit boxes. -Check and fix on big endian boxes. -Test and make sure PCI latency is now correct for all cases. - - -Authors: - -Sten Wang : Original Author - -Contributors: - -Marcelo Tosatti -Alan Cox -Jeff Garzik -Vojtech Pavlik diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst index 4ad13ffb5800..09728e964ce1 100644 --- a/Documentation/networking/device_drivers/index.rst +++ b/Documentation/networking/device_drivers/index.rst @@ -35,6 +35,7 @@ Contents: cirrus/cs89x0 davicom/dm9000 dec/de4x5 + dec/dmfe .. only:: subproject and html diff --git a/MAINTAINERS b/MAINTAINERS index b5cfee17635e..f0b18c156176 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -4694,7 +4694,7 @@ F: net/ax25/sysctl_net_ax25.c DAVICOM FAST ETHERNET (DMFE) NETWORK DRIVER L: netdev@vger.kernel.org S: Orphan -F: Documentation/networking/device_drivers/dec/dmfe.txt +F: Documentation/networking/device_drivers/dec/dmfe.rst F: drivers/net/ethernet/dec/tulip/dmfe.c DC390/AM53C974 SCSI driver diff --git a/drivers/net/ethernet/dec/tulip/Kconfig b/drivers/net/ethernet/dec/tulip/Kconfig index 8c4245d94bb2..177f36f4b89d 100644 --- a/drivers/net/ethernet/dec/tulip/Kconfig +++ b/drivers/net/ethernet/dec/tulip/Kconfig @@ -138,7 +138,7 @@ config DM9102 This driver is for DM9102(A)/DM9132/DM9801 compatible PCI cards from Davicom (). If you have such a network (Ethernet) card, say Y. Some information is contained in the file - . + . To compile this driver as a module, choose M here. The module will be called dmfe. -- cgit v1.2.3 From ca705e4793f024afb8e86030e08b1e0f16dcc07c Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:43 +0200 Subject: docs: networking: device drivers: convert dlink/dl2k.txt to ReST - add SPDX header; - mark code blocks and literals as such; - mark lists as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- .../networking/device_drivers/dlink/dl2k.rst | 314 +++++++++++++++++++++ .../networking/device_drivers/dlink/dl2k.txt | 282 ------------------ Documentation/networking/device_drivers/index.rst | 1 + drivers/net/ethernet/dlink/dl2k.c | 2 +- 4 files changed, 316 insertions(+), 283 deletions(-) create mode 100644 Documentation/networking/device_drivers/dlink/dl2k.rst delete mode 100644 Documentation/networking/device_drivers/dlink/dl2k.txt (limited to 'Documentation') diff --git a/Documentation/networking/device_drivers/dlink/dl2k.rst b/Documentation/networking/device_drivers/dlink/dl2k.rst new file mode 100644 index 000000000000..ccdb5d0d7460 --- /dev/null +++ b/Documentation/networking/device_drivers/dlink/dl2k.rst @@ -0,0 +1,314 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================================= +D-Link DL2000-based Gigabit Ethernet Adapter Installation +========================================================= + +May 23, 2002 + +.. Contents + + - Compatibility List + - Quick Install + - Compiling the Driver + - Installing the Driver + - Option parameter + - Configuration Script Sample + - Troubleshooting + + +Compatibility List +================== + +Adapter Support: + +- D-Link DGE-550T Gigabit Ethernet Adapter. +- D-Link DGE-550SX Gigabit Ethernet Adapter. +- D-Link DL2000-based Gigabit Ethernet Adapter. + + +The driver support Linux kernel 2.4.7 later. We had tested it +on the environments below. + + . Red Hat v6.2 (update kernel to 2.4.7) + . Red Hat v7.0 (update kernel to 2.4.7) + . Red Hat v7.1 (kernel 2.4.7) + . Red Hat v7.2 (kernel 2.4.7-10) + + +Quick Install +============= +Install linux driver as following command:: + + 1. make all + 2. insmod dl2k.ko + 3. ifconfig eth0 up 10.xxx.xxx.xxx netmask 255.0.0.0 + ^^^^^^^^^^^^^^^\ ^^^^^^^^\ + IP NETMASK + +Now eth0 should active, you can test it by "ping" or get more information by +"ifconfig". If tested ok, continue the next step. + +4. ``cp dl2k.ko /lib/modules/`uname -r`/kernel/drivers/net`` +5. Add the following line to /etc/modprobe.d/dl2k.conf:: + + alias eth0 dl2k + +6. Run ``depmod`` to updated module indexes. +7. Run ``netconfig`` or ``netconf`` to create configuration script ifcfg-eth0 + located at /etc/sysconfig/network-scripts or create it manually. + + [see - Configuration Script Sample] +8. Driver will automatically load and configure at next boot time. + +Compiling the Driver +==================== +In Linux, NIC drivers are most commonly configured as loadable modules. +The approach of building a monolithic kernel has become obsolete. The driver +can be compiled as part of a monolithic kernel, but is strongly discouraged. +The remainder of this section assumes the driver is built as a loadable module. +In the Linux environment, it is a good idea to rebuild the driver from the +source instead of relying on a precompiled version. This approach provides +better reliability since a precompiled driver might depend on libraries or +kernel features that are not present in a given Linux installation. + +The 3 files necessary to build Linux device driver are dl2k.c, dl2k.h and +Makefile. To compile, the Linux installation must include the gcc compiler, +the kernel source, and the kernel headers. The Linux driver supports Linux +Kernels 2.4.7. Copy the files to a directory and enter the following command +to compile and link the driver: + +CD-ROM drive +------------ + +:: + + [root@XXX /] mkdir cdrom + [root@XXX /] mount -r -t iso9660 -o conv=auto /dev/cdrom /cdrom + [root@XXX /] cd root + [root@XXX /root] mkdir dl2k + [root@XXX /root] cd dl2k + [root@XXX dl2k] cp /cdrom/linux/dl2k.tgz /root/dl2k + [root@XXX dl2k] tar xfvz dl2k.tgz + [root@XXX dl2k] make all + +Floppy disc drive +----------------- + +:: + + [root@XXX /] cd root + [root@XXX /root] mkdir dl2k + [root@XXX /root] cd dl2k + [root@XXX dl2k] mcopy a:/linux/dl2k.tgz /root/dl2k + [root@XXX dl2k] tar xfvz dl2k.tgz + [root@XXX dl2k] make all + +Installing the Driver +===================== + +Manual Installation +------------------- + + Once the driver has been compiled, it must be loaded, enabled, and bound + to a protocol stack in order to establish network connectivity. To load a + module enter the command:: + + insmod dl2k.o + + or:: + + insmod dl2k.o ; add parameter + +--------------------------------------------------------- + + example:: + + insmod dl2k.o media=100mbps_hd + + or:: + + insmod dl2k.o media=3 + + or:: + + insmod dl2k.o media=3,2 ; for 2 cards + +--------------------------------------------------------- + + Please reference the list of the command line parameters supported by + the Linux device driver below. + + The insmod command only loads the driver and gives it a name of the form + eth0, eth1, etc. To bring the NIC into an operational state, + it is necessary to issue the following command:: + + ifconfig eth0 up + + Finally, to bind the driver to the active protocol (e.g., TCP/IP with + Linux), enter the following command:: + + ifup eth0 + + Note that this is meaningful only if the system can find a configuration + script that contains the necessary network information. A sample will be + given in the next paragraph. + + The commands to unload a driver are as follows:: + + ifdown eth0 + ifconfig eth0 down + rmmod dl2k.o + + The following are the commands to list the currently loaded modules and + to see the current network configuration:: + + lsmod + ifconfig + + +Automated Installation +---------------------- + This section describes how to install the driver such that it is + automatically loaded and configured at boot time. The following description + is based on a Red Hat 6.0/7.0 distribution, but it can easily be ported to + other distributions as well. + +Red Hat v6.x/v7.x +----------------- + 1. Copy dl2k.o to the network modules directory, typically + /lib/modules/2.x.x-xx/net or /lib/modules/2.x.x/kernel/drivers/net. + 2. Locate the boot module configuration file, most commonly in the + /etc/modprobe.d/ directory. Add the following lines:: + + alias ethx dl2k + options dl2k + + where ethx will be eth0 if the NIC is the only ethernet adapter, eth1 if + one other ethernet adapter is installed, etc. Refer to the table in the + previous section for the list of optional parameters. + 3. Locate the network configuration scripts, normally the + /etc/sysconfig/network-scripts directory, and create a configuration + script named ifcfg-ethx that contains network information. + 4. Note that for most Linux distributions, Red Hat included, a configuration + utility with a graphical user interface is provided to perform steps 2 + and 3 above. + + +Parameter Description +===================== +You can install this driver without any additional parameter. However, if you +are going to have extensive functions then it is necessary to set extra +parameter. Below is a list of the command line parameters supported by the +Linux device +driver. + + +=============================== ============================================== +mtu=packet_size Specifies the maximum packet size. default + is 1500. + +media=media_type Specifies the media type the NIC operates at. + autosense Autosensing active media. + + =========== ========================= + 10mbps_hd 10Mbps half duplex. + 10mbps_fd 10Mbps full duplex. + 100mbps_hd 100Mbps half duplex. + 100mbps_fd 100Mbps full duplex. + 1000mbps_fd 1000Mbps full duplex. + 1000mbps_hd 1000Mbps half duplex. + 0 Autosensing active media. + 1 10Mbps half duplex. + 2 10Mbps full duplex. + 3 100Mbps half duplex. + 4 100Mbps full duplex. + 5 1000Mbps half duplex. + 6 1000Mbps full duplex. + =========== ========================= + + By default, the NIC operates at autosense. + 1000mbps_fd and 1000mbps_hd types are only + available for fiber adapter. + +vlan=n Specifies the VLAN ID. If vlan=0, the + Virtual Local Area Network (VLAN) function is + disable. + +jumbo=[0|1] Specifies the jumbo frame support. If jumbo=1, + the NIC accept jumbo frames. By default, this + function is disabled. + Jumbo frame usually improve the performance + int gigabit. + This feature need jumbo frame compatible + remote. + +rx_coalesce=m Number of rx frame handled each interrupt. +rx_timeout=n Rx DMA wait time for an interrupt. + If set rx_coalesce > 0, hardware only assert + an interrupt for m frames. Hardware won't + assert rx interrupt until m frames received or + reach timeout of n * 640 nano seconds. + Set proper rx_coalesce and rx_timeout can + reduce congestion collapse and overload which + has been a bottleneck for high speed network. + + For example, rx_coalesce=10 rx_timeout=800. + that is, hardware assert only 1 interrupt + for 10 frames received or timeout of 512 us. + +tx_coalesce=n Number of tx frame handled each interrupt. + Set n > 1 can reduce the interrupts + congestion usually lower performance of + high speed network card. Default is 16. + +tx_flow=[1|0] Specifies the Tx flow control. If tx_flow=0, + the Tx flow control disable else driver + autodetect. +rx_flow=[1|0] Specifies the Rx flow control. If rx_flow=0, + the Rx flow control enable else driver + autodetect. +=============================== ============================================== + + +Configuration Script Sample +=========================== +Here is a sample of a simple configuration script:: + + DEVICE=eth0 + USERCTL=no + ONBOOT=yes + POOTPROTO=none + BROADCAST=207.200.5.255 + NETWORK=207.200.5.0 + NETMASK=255.255.255.0 + IPADDR=207.200.5.2 + + +Troubleshooting +=============== +Q1. Source files contain ^ M behind every line. + + Make sure all files are Unix file format (no LF). Try the following + shell command to convert files:: + + cat dl2k.c | col -b > dl2k.tmp + mv dl2k.tmp dl2k.c + + OR:: + + cat dl2k.c | tr -d "\r" > dl2k.tmp + mv dl2k.tmp dl2k.c + +Q2: Could not find header files (``*.h``)? + + To compile the driver, you need kernel header files. After + installing the kernel source, the header files are usually located in + /usr/src/linux/include, which is the default include directory configured + in Makefile. For some distributions, there is a copy of header files in + /usr/src/include/linux and /usr/src/include/asm, that you can change the + INCLUDEDIR in Makefile to /usr/include without installing kernel source. + + Note that RH 7.0 didn't provide correct header files in /usr/include, + including those files will make a wrong version driver. + diff --git a/Documentation/networking/device_drivers/dlink/dl2k.txt b/Documentation/networking/device_drivers/dlink/dl2k.txt deleted file mode 100644 index cba74f7a3abc..000000000000 --- a/Documentation/networking/device_drivers/dlink/dl2k.txt +++ /dev/null @@ -1,282 +0,0 @@ - - D-Link DL2000-based Gigabit Ethernet Adapter Installation - for Linux - May 23, 2002 - -Contents -======== - - Compatibility List - - Quick Install - - Compiling the Driver - - Installing the Driver - - Option parameter - - Configuration Script Sample - - Troubleshooting - - -Compatibility List -================= -Adapter Support: - -D-Link DGE-550T Gigabit Ethernet Adapter. -D-Link DGE-550SX Gigabit Ethernet Adapter. -D-Link DL2000-based Gigabit Ethernet Adapter. - - -The driver support Linux kernel 2.4.7 later. We had tested it -on the environments below. - - . Red Hat v6.2 (update kernel to 2.4.7) - . Red Hat v7.0 (update kernel to 2.4.7) - . Red Hat v7.1 (kernel 2.4.7) - . Red Hat v7.2 (kernel 2.4.7-10) - - -Quick Install -============= -Install linux driver as following command: - -1. make all -2. insmod dl2k.ko -3. ifconfig eth0 up 10.xxx.xxx.xxx netmask 255.0.0.0 - ^^^^^^^^^^^^^^^\ ^^^^^^^^\ - IP NETMASK -Now eth0 should active, you can test it by "ping" or get more information by -"ifconfig". If tested ok, continue the next step. - -4. cp dl2k.ko /lib/modules/`uname -r`/kernel/drivers/net -5. Add the following line to /etc/modprobe.d/dl2k.conf: - alias eth0 dl2k -6. Run depmod to updated module indexes. -7. Run "netconfig" or "netconf" to create configuration script ifcfg-eth0 - located at /etc/sysconfig/network-scripts or create it manually. - [see - Configuration Script Sample] -8. Driver will automatically load and configure at next boot time. - -Compiling the Driver -==================== - In Linux, NIC drivers are most commonly configured as loadable modules. -The approach of building a monolithic kernel has become obsolete. The driver -can be compiled as part of a monolithic kernel, but is strongly discouraged. -The remainder of this section assumes the driver is built as a loadable module. -In the Linux environment, it is a good idea to rebuild the driver from the -source instead of relying on a precompiled version. This approach provides -better reliability since a precompiled driver might depend on libraries or -kernel features that are not present in a given Linux installation. - -The 3 files necessary to build Linux device driver are dl2k.c, dl2k.h and -Makefile. To compile, the Linux installation must include the gcc compiler, -the kernel source, and the kernel headers. The Linux driver supports Linux -Kernels 2.4.7. Copy the files to a directory and enter the following command -to compile and link the driver: - -CD-ROM drive ------------- - -[root@XXX /] mkdir cdrom -[root@XXX /] mount -r -t iso9660 -o conv=auto /dev/cdrom /cdrom -[root@XXX /] cd root -[root@XXX /root] mkdir dl2k -[root@XXX /root] cd dl2k -[root@XXX dl2k] cp /cdrom/linux/dl2k.tgz /root/dl2k -[root@XXX dl2k] tar xfvz dl2k.tgz -[root@XXX dl2k] make all - -Floppy disc drive ------------------ - -[root@XXX /] cd root -[root@XXX /root] mkdir dl2k -[root@XXX /root] cd dl2k -[root@XXX dl2k] mcopy a:/linux/dl2k.tgz /root/dl2k -[root@XXX dl2k] tar xfvz dl2k.tgz -[root@XXX dl2k] make all - -Installing the Driver -===================== - - Manual Installation - ------------------- - Once the driver has been compiled, it must be loaded, enabled, and bound - to a protocol stack in order to establish network connectivity. To load a - module enter the command: - - insmod dl2k.o - - or - - insmod dl2k.o ; add parameter - - =============================================================== - example: insmod dl2k.o media=100mbps_hd - or insmod dl2k.o media=3 - or insmod dl2k.o media=3,2 ; for 2 cards - =============================================================== - - Please reference the list of the command line parameters supported by - the Linux device driver below. - - The insmod command only loads the driver and gives it a name of the form - eth0, eth1, etc. To bring the NIC into an operational state, - it is necessary to issue the following command: - - ifconfig eth0 up - - Finally, to bind the driver to the active protocol (e.g., TCP/IP with - Linux), enter the following command: - - ifup eth0 - - Note that this is meaningful only if the system can find a configuration - script that contains the necessary network information. A sample will be - given in the next paragraph. - - The commands to unload a driver are as follows: - - ifdown eth0 - ifconfig eth0 down - rmmod dl2k.o - - The following are the commands to list the currently loaded modules and - to see the current network configuration. - - lsmod - ifconfig - - - Automated Installation - ---------------------- - This section describes how to install the driver such that it is - automatically loaded and configured at boot time. The following description - is based on a Red Hat 6.0/7.0 distribution, but it can easily be ported to - other distributions as well. - - Red Hat v6.x/v7.x - ----------------- - 1. Copy dl2k.o to the network modules directory, typically - /lib/modules/2.x.x-xx/net or /lib/modules/2.x.x/kernel/drivers/net. - 2. Locate the boot module configuration file, most commonly in the - /etc/modprobe.d/ directory. Add the following lines: - - alias ethx dl2k - options dl2k - - where ethx will be eth0 if the NIC is the only ethernet adapter, eth1 if - one other ethernet adapter is installed, etc. Refer to the table in the - previous section for the list of optional parameters. - 3. Locate the network configuration scripts, normally the - /etc/sysconfig/network-scripts directory, and create a configuration - script named ifcfg-ethx that contains network information. - 4. Note that for most Linux distributions, Red Hat included, a configuration - utility with a graphical user interface is provided to perform steps 2 - and 3 above. - - -Parameter Description -===================== -You can install this driver without any additional parameter. However, if you -are going to have extensive functions then it is necessary to set extra -parameter. Below is a list of the command line parameters supported by the -Linux device -driver. - -mtu=packet_size - Specifies the maximum packet size. default - is 1500. - -media=media_type - Specifies the media type the NIC operates at. - autosense Autosensing active media. - 10mbps_hd 10Mbps half duplex. - 10mbps_fd 10Mbps full duplex. - 100mbps_hd 100Mbps half duplex. - 100mbps_fd 100Mbps full duplex. - 1000mbps_fd 1000Mbps full duplex. - 1000mbps_hd 1000Mbps half duplex. - 0 Autosensing active media. - 1 10Mbps half duplex. - 2 10Mbps full duplex. - 3 100Mbps half duplex. - 4 100Mbps full duplex. - 5 1000Mbps half duplex. - 6 1000Mbps full duplex. - - By default, the NIC operates at autosense. - 1000mbps_fd and 1000mbps_hd types are only - available for fiber adapter. - -vlan=n - Specifies the VLAN ID. If vlan=0, the - Virtual Local Area Network (VLAN) function is - disable. - -jumbo=[0|1] - Specifies the jumbo frame support. If jumbo=1, - the NIC accept jumbo frames. By default, this - function is disabled. - Jumbo frame usually improve the performance - int gigabit. - This feature need jumbo frame compatible - remote. - -rx_coalesce=m - Number of rx frame handled each interrupt. -rx_timeout=n - Rx DMA wait time for an interrupt. - If set rx_coalesce > 0, hardware only assert - an interrupt for m frames. Hardware won't - assert rx interrupt until m frames received or - reach timeout of n * 640 nano seconds. - Set proper rx_coalesce and rx_timeout can - reduce congestion collapse and overload which - has been a bottleneck for high speed network. - - For example, rx_coalesce=10 rx_timeout=800. - that is, hardware assert only 1 interrupt - for 10 frames received or timeout of 512 us. - -tx_coalesce=n - Number of tx frame handled each interrupt. - Set n > 1 can reduce the interrupts - congestion usually lower performance of - high speed network card. Default is 16. - -tx_flow=[1|0] - Specifies the Tx flow control. If tx_flow=0, - the Tx flow control disable else driver - autodetect. -rx_flow=[1|0] - Specifies the Rx flow control. If rx_flow=0, - the Rx flow control enable else driver - autodetect. - - -Configuration Script Sample -=========================== -Here is a sample of a simple configuration script: - -DEVICE=eth0 -USERCTL=no -ONBOOT=yes -POOTPROTO=none -BROADCAST=207.200.5.255 -NETWORK=207.200.5.0 -NETMASK=255.255.255.0 -IPADDR=207.200.5.2 - - -Troubleshooting -=============== -Q1. Source files contain ^ M behind every line. - Make sure all files are Unix file format (no LF). Try the following - shell command to convert files. - - cat dl2k.c | col -b > dl2k.tmp - mv dl2k.tmp dl2k.c - - OR - - cat dl2k.c | tr -d "\r" > dl2k.tmp - mv dl2k.tmp dl2k.c - -Q2: Could not find header files (*.h) ? - To compile the driver, you need kernel header files. After - installing the kernel source, the header files are usually located in - /usr/src/linux/include, which is the default include directory configured - in Makefile. For some distributions, there is a copy of header files in - /usr/src/include/linux and /usr/src/include/asm, that you can change the - INCLUDEDIR in Makefile to /usr/include without installing kernel source. - Note that RH 7.0 didn't provide correct header files in /usr/include, - including those files will make a wrong version driver. - diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst index 09728e964ce1..e5d1863379cb 100644 --- a/Documentation/networking/device_drivers/index.rst +++ b/Documentation/networking/device_drivers/index.rst @@ -36,6 +36,7 @@ Contents: davicom/dm9000 dec/de4x5 dec/dmfe + dlink/dl2k .. only:: subproject and html diff --git a/drivers/net/ethernet/dlink/dl2k.c b/drivers/net/ethernet/dlink/dl2k.c index 643090555cc7..5143722c4419 100644 --- a/drivers/net/ethernet/dlink/dl2k.c +++ b/drivers/net/ethernet/dlink/dl2k.c @@ -1869,7 +1869,7 @@ Compile command: gcc -D__KERNEL__ -DMODULE -I/usr/src/linux/include -Wall -Wstrict-prototypes -O2 -c dl2k.c -Read Documentation/networking/device_drivers/dlink/dl2k.txt for details. +Read Documentation/networking/device_drivers/dlink/dl2k.rst for details. */ -- cgit v1.2.3 From 0d0d976f59a57e5536ff01825f0bc8a0dbb0fe6b Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:44 +0200 Subject: docs: networking: device drivers: convert freescale/dpaa.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - mark code blocks and literals as such; - use :field: markup; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- .../networking/device_drivers/freescale/dpaa.rst | 269 +++++++++++++++++++++ .../networking/device_drivers/freescale/dpaa.txt | 260 -------------------- Documentation/networking/device_drivers/index.rst | 1 + 3 files changed, 270 insertions(+), 260 deletions(-) create mode 100644 Documentation/networking/device_drivers/freescale/dpaa.rst delete mode 100644 Documentation/networking/device_drivers/freescale/dpaa.txt (limited to 'Documentation') diff --git a/Documentation/networking/device_drivers/freescale/dpaa.rst b/Documentation/networking/device_drivers/freescale/dpaa.rst new file mode 100644 index 000000000000..241c6c6f6e68 --- /dev/null +++ b/Documentation/networking/device_drivers/freescale/dpaa.rst @@ -0,0 +1,269 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================== +The QorIQ DPAA Ethernet Driver +============================== + +Authors: +- Madalin Bucur +- Camelia Groza + +.. Contents + + - DPAA Ethernet Overview + - DPAA Ethernet Supported SoCs + - Configuring DPAA Ethernet in your kernel + - DPAA Ethernet Frame Processing + - DPAA Ethernet Features + - DPAA IRQ Affinity and Receive Side Scaling + - Debugging + +DPAA Ethernet Overview +====================== + +DPAA stands for Data Path Acceleration Architecture and it is a +set of networking acceleration IPs that are available on several +generations of SoCs, both on PowerPC and ARM64. + +The Freescale DPAA architecture consists of a series of hardware blocks +that support Ethernet connectivity. The Ethernet driver depends upon the +following drivers in the Linux kernel: + + - Peripheral Access Memory Unit (PAMU) (* needed only for PPC platforms) + drivers/iommu/fsl_* + - Frame Manager (FMan) + drivers/net/ethernet/freescale/fman + - Queue Manager (QMan), Buffer Manager (BMan) + drivers/soc/fsl/qbman + +A simplified view of the dpaa_eth interfaces mapped to FMan MACs:: + + dpaa_eth /eth0\ ... /ethN\ + driver | | | | + ------------- ---- ----------- ---- ------------- + -Ports / Tx Rx \ ... / Tx Rx \ + FMan | | | | + -MACs | MAC0 | | MACN | + / dtsec0 \ ... / dtsecN \ (or tgec) + / \ / \(or memac) + --------- -------------- --- -------------- --------- + FMan, FMan Port, FMan SP, FMan MURAM drivers + --------------------------------------------------------- + FMan HW blocks: MURAM, MACs, Ports, SP + --------------------------------------------------------- + +The dpaa_eth relation to the QMan, BMan and FMan:: + + ________________________________ + dpaa_eth / eth0 \ + driver / \ + --------- -^- -^- -^- --- --------- + QMan driver / \ / \ / \ \ / | BMan | + |Rx | |Rx | |Tx | |Tx | | driver | + --------- |Dfl| |Err| |Cnf| |FQs| | | + QMan HW |FQ | |FQ | |FQs| | | | | + / \ / \ / \ \ / | | + --------- --- --- --- -v- --------- + | FMan QMI | | + | FMan HW FMan BMI | BMan HW | + ----------------------- -------- + +where the acronyms used above (and in the code) are: + +=============== =========================================================== +DPAA Data Path Acceleration Architecture +FMan DPAA Frame Manager +QMan DPAA Queue Manager +BMan DPAA Buffers Manager +QMI QMan interface in FMan +BMI BMan interface in FMan +FMan SP FMan Storage Profiles +MURAM Multi-user RAM in FMan +FQ QMan Frame Queue +Rx Dfl FQ default reception FQ +Rx Err FQ Rx error frames FQ +Tx Cnf FQ Tx confirmation FQs +Tx FQs transmission frame queues +dtsec datapath three speed Ethernet controller (10/100/1000 Mbps) +tgec ten gigabit Ethernet controller (10 Gbps) +memac multirate Ethernet MAC (10/100/1000/10000) +=============== =========================================================== + +DPAA Ethernet Supported SoCs +============================ + +The DPAA drivers enable the Ethernet controllers present on the following SoCs: + +PPC +- P1023 +- P2041 +- P3041 +- P4080 +- P5020 +- P5040 +- T1023 +- T1024 +- T1040 +- T1042 +- T2080 +- T4240 +- B4860 + +ARM +- LS1043A +- LS1046A + +Configuring DPAA Ethernet in your kernel +======================================== + +To enable the DPAA Ethernet driver, the following Kconfig options are required:: + + # common for arch/arm64 and arch/powerpc platforms + CONFIG_FSL_DPAA=y + CONFIG_FSL_FMAN=y + CONFIG_FSL_DPAA_ETH=y + CONFIG_FSL_XGMAC_MDIO=y + + # for arch/powerpc only + CONFIG_FSL_PAMU=y + + # common options needed for the PHYs used on the RDBs + CONFIG_VITESSE_PHY=y + CONFIG_REALTEK_PHY=y + CONFIG_AQUANTIA_PHY=y + +DPAA Ethernet Frame Processing +============================== + +On Rx, buffers for the incoming frames are retrieved from the buffers found +in the dedicated interface buffer pool. The driver initializes and seeds these +with one page buffers. + +On Tx, all transmitted frames are returned to the driver through Tx +confirmation frame queues. The driver is then responsible for freeing the +buffers. In order to do this properly, a backpointer is added to the buffer +before transmission that points to the skb. When the buffer returns to the +driver on a confirmation FQ, the skb can be correctly consumed. + +DPAA Ethernet Features +====================== + +Currently the DPAA Ethernet driver enables the basic features required for +a Linux Ethernet driver. The support for advanced features will be added +gradually. + +The driver has Rx and Tx checksum offloading for UDP and TCP. Currently the Rx +checksum offload feature is enabled by default and cannot be controlled through +ethtool. Also, rx-flow-hash and rx-hashing was added. The addition of RSS +provides a big performance boost for the forwarding scenarios, allowing +different traffic flows received by one interface to be processed by different +CPUs in parallel. + +The driver has support for multiple prioritized Tx traffic classes. Priorities +range from 0 (lowest) to 3 (highest). These are mapped to HW workqueues with +strict priority levels. Each traffic class contains NR_CPU TX queues. By +default, only one traffic class is enabled and the lowest priority Tx queues +are used. Higher priority traffic classes can be enabled with the mqprio +qdisc. For example, all four traffic classes are enabled on an interface with +the following command. Furthermore, skb priority levels are mapped to traffic +classes as follows: + + * priorities 0 to 3 - traffic class 0 (low priority) + * priorities 4 to 7 - traffic class 1 (medium-low priority) + * priorities 8 to 11 - traffic class 2 (medium-high priority) + * priorities 12 to 15 - traffic class 3 (high priority) + +:: + + tc qdisc add dev root handle 1: \ + mqprio num_tc 4 map 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 hw 1 + +DPAA IRQ Affinity and Receive Side Scaling +========================================== + +Traffic coming on the DPAA Rx queues or on the DPAA Tx confirmation +queues is seen by the CPU as ingress traffic on a certain portal. +The DPAA QMan portal interrupts are affined each to a certain CPU. +The same portal interrupt services all the QMan portal consumers. + +By default the DPAA Ethernet driver enables RSS, making use of the +DPAA FMan Parser and Keygen blocks to distribute traffic on 128 +hardware frame queues using a hash on IP v4/v6 source and destination +and L4 source and destination ports, in present in the received frame. +When RSS is disabled, all traffic received by a certain interface is +received on the default Rx frame queue. The default DPAA Rx frame +queues are configured to put the received traffic into a pool channel +that allows any available CPU portal to dequeue the ingress traffic. +The default frame queues have the HOLDACTIVE option set, ensuring that +traffic bursts from a certain queue are serviced by the same CPU. +This ensures a very low rate of frame reordering. A drawback of this +is that only one CPU at a time can service the traffic received by a +certain interface when RSS is not enabled. + +To implement RSS, the DPAA Ethernet driver allocates an extra set of +128 Rx frame queues that are configured to dedicated channels, in a +round-robin manner. The mapping of the frame queues to CPUs is now +hardcoded, there is no indirection table to move traffic for a certain +FQ (hash result) to another CPU. The ingress traffic arriving on one +of these frame queues will arrive at the same portal and will always +be processed by the same CPU. This ensures intra-flow order preservation +and workload distribution for multiple traffic flows. + +RSS can be turned off for a certain interface using ethtool, i.e.:: + + # ethtool -N fm1-mac9 rx-flow-hash tcp4 "" + +To turn it back on, one needs to set rx-flow-hash for tcp4/6 or udp4/6:: + + # ethtool -N fm1-mac9 rx-flow-hash udp4 sfdn + +There is no independent control for individual protocols, any command +run for one of tcp4|udp4|ah4|esp4|sctp4|tcp6|udp6|ah6|esp6|sctp6 is +going to control the rx-flow-hashing for all protocols on that interface. + +Besides using the FMan Keygen computed hash for spreading traffic on the +128 Rx FQs, the DPAA Ethernet driver also sets the skb hash value when +the NETIF_F_RXHASH feature is on (active by default). This can be turned +on or off through ethtool, i.e.:: + + # ethtool -K fm1-mac9 rx-hashing off + # ethtool -k fm1-mac9 | grep hash + receive-hashing: off + # ethtool -K fm1-mac9 rx-hashing on + Actual changes: + receive-hashing: on + # ethtool -k fm1-mac9 | grep hash + receive-hashing: on + +Please note that Rx hashing depends upon the rx-flow-hashing being on +for that interface - turning off rx-flow-hashing will also disable the +rx-hashing (without ethtool reporting it as off as that depends on the +NETIF_F_RXHASH feature flag). + +Debugging +========= + +The following statistics are exported for each interface through ethtool: + + - interrupt count per CPU + - Rx packets count per CPU + - Tx packets count per CPU + - Tx confirmed packets count per CPU + - Tx S/G frames count per CPU + - Tx error count per CPU + - Rx error count per CPU + - Rx error count per type + - congestion related statistics: + + - congestion status + - time spent in congestion + - number of time the device entered congestion + - dropped packets count per cause + +The driver also exports the following information in sysfs: + + - the FQ IDs for each FQ type + /sys/devices/platform/soc/.fman/.ethernet/dpaa-ethernet./net/fm-mac/fqids + + - the ID of the buffer pool in use + /sys/devices/platform/soc/.fman/.ethernet/dpaa-ethernet./net/fm-mac/bpids diff --git a/Documentation/networking/device_drivers/freescale/dpaa.txt b/Documentation/networking/device_drivers/freescale/dpaa.txt deleted file mode 100644 index b06601ff9200..000000000000 --- a/Documentation/networking/device_drivers/freescale/dpaa.txt +++ /dev/null @@ -1,260 +0,0 @@ -The QorIQ DPAA Ethernet Driver -============================== - -Authors: -Madalin Bucur -Camelia Groza - -Contents -======== - - - DPAA Ethernet Overview - - DPAA Ethernet Supported SoCs - - Configuring DPAA Ethernet in your kernel - - DPAA Ethernet Frame Processing - - DPAA Ethernet Features - - DPAA IRQ Affinity and Receive Side Scaling - - Debugging - -DPAA Ethernet Overview -====================== - -DPAA stands for Data Path Acceleration Architecture and it is a -set of networking acceleration IPs that are available on several -generations of SoCs, both on PowerPC and ARM64. - -The Freescale DPAA architecture consists of a series of hardware blocks -that support Ethernet connectivity. The Ethernet driver depends upon the -following drivers in the Linux kernel: - - - Peripheral Access Memory Unit (PAMU) (* needed only for PPC platforms) - drivers/iommu/fsl_* - - Frame Manager (FMan) - drivers/net/ethernet/freescale/fman - - Queue Manager (QMan), Buffer Manager (BMan) - drivers/soc/fsl/qbman - -A simplified view of the dpaa_eth interfaces mapped to FMan MACs: - - dpaa_eth /eth0\ ... /ethN\ - driver | | | | - ------------- ---- ----------- ---- ------------- - -Ports / Tx Rx \ ... / Tx Rx \ - FMan | | | | - -MACs | MAC0 | | MACN | - / dtsec0 \ ... / dtsecN \ (or tgec) - / \ / \(or memac) - --------- -------------- --- -------------- --------- - FMan, FMan Port, FMan SP, FMan MURAM drivers - --------------------------------------------------------- - FMan HW blocks: MURAM, MACs, Ports, SP - --------------------------------------------------------- - -The dpaa_eth relation to the QMan, BMan and FMan: - ________________________________ - dpaa_eth / eth0 \ - driver / \ - --------- -^- -^- -^- --- --------- - QMan driver / \ / \ / \ \ / | BMan | - |Rx | |Rx | |Tx | |Tx | | driver | - --------- |Dfl| |Err| |Cnf| |FQs| | | - QMan HW |FQ | |FQ | |FQs| | | | | - / \ / \ / \ \ / | | - --------- --- --- --- -v- --------- - | FMan QMI | | - | FMan HW FMan BMI | BMan HW | - ----------------------- -------- - -where the acronyms used above (and in the code) are: -DPAA = Data Path Acceleration Architecture -FMan = DPAA Frame Manager -QMan = DPAA Queue Manager -BMan = DPAA Buffers Manager -QMI = QMan interface in FMan -BMI = BMan interface in FMan -FMan SP = FMan Storage Profiles -MURAM = Multi-user RAM in FMan -FQ = QMan Frame Queue -Rx Dfl FQ = default reception FQ -Rx Err FQ = Rx error frames FQ -Tx Cnf FQ = Tx confirmation FQs -Tx FQs = transmission frame queues -dtsec = datapath three speed Ethernet controller (10/100/1000 Mbps) -tgec = ten gigabit Ethernet controller (10 Gbps) -memac = multirate Ethernet MAC (10/100/1000/10000) - -DPAA Ethernet Supported SoCs -============================ - -The DPAA drivers enable the Ethernet controllers present on the following SoCs: - -# PPC -P1023 -P2041 -P3041 -P4080 -P5020 -P5040 -T1023 -T1024 -T1040 -T1042 -T2080 -T4240 -B4860 - -# ARM -LS1043A -LS1046A - -Configuring DPAA Ethernet in your kernel -======================================== - -To enable the DPAA Ethernet driver, the following Kconfig options are required: - -# common for arch/arm64 and arch/powerpc platforms -CONFIG_FSL_DPAA=y -CONFIG_FSL_FMAN=y -CONFIG_FSL_DPAA_ETH=y -CONFIG_FSL_XGMAC_MDIO=y - -# for arch/powerpc only -CONFIG_FSL_PAMU=y - -# common options needed for the PHYs used on the RDBs -CONFIG_VITESSE_PHY=y -CONFIG_REALTEK_PHY=y -CONFIG_AQUANTIA_PHY=y - -DPAA Ethernet Frame Processing -============================== - -On Rx, buffers for the incoming frames are retrieved from the buffers found -in the dedicated interface buffer pool. The driver initializes and seeds these -with one page buffers. - -On Tx, all transmitted frames are returned to the driver through Tx -confirmation frame queues. The driver is then responsible for freeing the -buffers. In order to do this properly, a backpointer is added to the buffer -before transmission that points to the skb. When the buffer returns to the -driver on a confirmation FQ, the skb can be correctly consumed. - -DPAA Ethernet Features -====================== - -Currently the DPAA Ethernet driver enables the basic features required for -a Linux Ethernet driver. The support for advanced features will be added -gradually. - -The driver has Rx and Tx checksum offloading for UDP and TCP. Currently the Rx -checksum offload feature is enabled by default and cannot be controlled through -ethtool. Also, rx-flow-hash and rx-hashing was added. The addition of RSS -provides a big performance boost for the forwarding scenarios, allowing -different traffic flows received by one interface to be processed by different -CPUs in parallel. - -The driver has support for multiple prioritized Tx traffic classes. Priorities -range from 0 (lowest) to 3 (highest). These are mapped to HW workqueues with -strict priority levels. Each traffic class contains NR_CPU TX queues. By -default, only one traffic class is enabled and the lowest priority Tx queues -are used. Higher priority traffic classes can be enabled with the mqprio -qdisc. For example, all four traffic classes are enabled on an interface with -the following command. Furthermore, skb priority levels are mapped to traffic -classes as follows: - - * priorities 0 to 3 - traffic class 0 (low priority) - * priorities 4 to 7 - traffic class 1 (medium-low priority) - * priorities 8 to 11 - traffic class 2 (medium-high priority) - * priorities 12 to 15 - traffic class 3 (high priority) - -tc qdisc add dev root handle 1: \ - mqprio num_tc 4 map 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 hw 1 - -DPAA IRQ Affinity and Receive Side Scaling -========================================== - -Traffic coming on the DPAA Rx queues or on the DPAA Tx confirmation -queues is seen by the CPU as ingress traffic on a certain portal. -The DPAA QMan portal interrupts are affined each to a certain CPU. -The same portal interrupt services all the QMan portal consumers. - -By default the DPAA Ethernet driver enables RSS, making use of the -DPAA FMan Parser and Keygen blocks to distribute traffic on 128 -hardware frame queues using a hash on IP v4/v6 source and destination -and L4 source and destination ports, in present in the received frame. -When RSS is disabled, all traffic received by a certain interface is -received on the default Rx frame queue. The default DPAA Rx frame -queues are configured to put the received traffic into a pool channel -that allows any available CPU portal to dequeue the ingress traffic. -The default frame queues have the HOLDACTIVE option set, ensuring that -traffic bursts from a certain queue are serviced by the same CPU. -This ensures a very low rate of frame reordering. A drawback of this -is that only one CPU at a time can service the traffic received by a -certain interface when RSS is not enabled. - -To implement RSS, the DPAA Ethernet driver allocates an extra set of -128 Rx frame queues that are configured to dedicated channels, in a -round-robin manner. The mapping of the frame queues to CPUs is now -hardcoded, there is no indirection table to move traffic for a certain -FQ (hash result) to another CPU. The ingress traffic arriving on one -of these frame queues will arrive at the same portal and will always -be processed by the same CPU. This ensures intra-flow order preservation -and workload distribution for multiple traffic flows. - -RSS can be turned off for a certain interface using ethtool, i.e. - - # ethtool -N fm1-mac9 rx-flow-hash tcp4 "" - -To turn it back on, one needs to set rx-flow-hash for tcp4/6 or udp4/6: - - # ethtool -N fm1-mac9 rx-flow-hash udp4 sfdn - -There is no independent control for individual protocols, any command -run for one of tcp4|udp4|ah4|esp4|sctp4|tcp6|udp6|ah6|esp6|sctp6 is -going to control the rx-flow-hashing for all protocols on that interface. - -Besides using the FMan Keygen computed hash for spreading traffic on the -128 Rx FQs, the DPAA Ethernet driver also sets the skb hash value when -the NETIF_F_RXHASH feature is on (active by default). This can be turned -on or off through ethtool, i.e.: - - # ethtool -K fm1-mac9 rx-hashing off - # ethtool -k fm1-mac9 | grep hash - receive-hashing: off - # ethtool -K fm1-mac9 rx-hashing on - Actual changes: - receive-hashing: on - # ethtool -k fm1-mac9 | grep hash - receive-hashing: on - -Please note that Rx hashing depends upon the rx-flow-hashing being on -for that interface - turning off rx-flow-hashing will also disable the -rx-hashing (without ethtool reporting it as off as that depends on the -NETIF_F_RXHASH feature flag). - -Debugging -========= - -The following statistics are exported for each interface through ethtool: - - - interrupt count per CPU - - Rx packets count per CPU - - Tx packets count per CPU - - Tx confirmed packets count per CPU - - Tx S/G frames count per CPU - - Tx error count per CPU - - Rx error count per CPU - - Rx error count per type - - congestion related statistics: - - congestion status - - time spent in congestion - - number of time the device entered congestion - - dropped packets count per cause - -The driver also exports the following information in sysfs: - - - the FQ IDs for each FQ type - /sys/devices/platform/soc/.fman/.ethernet/dpaa-ethernet./net/fm-mac/fqids - - - the ID of the buffer pool in use - /sys/devices/platform/soc/.fman/.ethernet/dpaa-ethernet./net/fm-mac/bpids diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst index e5d1863379cb..7e59ee43c030 100644 --- a/Documentation/networking/device_drivers/index.rst +++ b/Documentation/networking/device_drivers/index.rst @@ -37,6 +37,7 @@ Contents: dec/de4x5 dec/dmfe dlink/dl2k + freescale/dpaa .. only:: subproject and html -- cgit v1.2.3 From dc67e91e7f7b2245a3a341e62217cb5e7163d60b Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:45 +0200 Subject: docs: networking: device drivers: convert freescale/gianfar.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - use :field: markup; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- .../device_drivers/freescale/gianfar.rst | 51 ++++++++++++++++++++++ .../device_drivers/freescale/gianfar.txt | 42 ------------------ Documentation/networking/device_drivers/index.rst | 1 + 3 files changed, 52 insertions(+), 42 deletions(-) create mode 100644 Documentation/networking/device_drivers/freescale/gianfar.rst delete mode 100644 Documentation/networking/device_drivers/freescale/gianfar.txt (limited to 'Documentation') diff --git a/Documentation/networking/device_drivers/freescale/gianfar.rst b/Documentation/networking/device_drivers/freescale/gianfar.rst new file mode 100644 index 000000000000..9c4a91d3824b --- /dev/null +++ b/Documentation/networking/device_drivers/freescale/gianfar.rst @@ -0,0 +1,51 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================== +The Gianfar Ethernet Driver +=========================== + +:Author: Andy Fleming +:Updated: 2005-07-28 + + +Checksum Offloading +=================== + +The eTSEC controller (first included in parts from late 2005 like +the 8548) has the ability to perform TCP, UDP, and IP checksums +in hardware. The Linux kernel only offloads the TCP and UDP +checksums (and always performs the pseudo header checksums), so +the driver only supports checksumming for TCP/IP and UDP/IP +packets. Use ethtool to enable or disable this feature for RX +and TX. + +VLAN +==== + +In order to use VLAN, please consult Linux documentation on +configuring VLANs. The gianfar driver supports hardware insertion and +extraction of VLAN headers, but not filtering. Filtering will be +done by the kernel. + +Multicasting +============ + +The gianfar driver supports using the group hash table on the +TSEC (and the extended hash table on the eTSEC) for multicast +filtering. On the eTSEC, the exact-match MAC registers are used +before the hash tables. See Linux documentation on how to join +multicast groups. + +Padding +======= + +The gianfar driver supports padding received frames with 2 bytes +to align the IP header to a 16-byte boundary, when supported by +hardware. + +Ethtool +======= + +The gianfar driver supports the use of ethtool for many +configuration options. You must run ethtool only on currently +open interfaces. See ethtool documentation for details. diff --git a/Documentation/networking/device_drivers/freescale/gianfar.txt b/Documentation/networking/device_drivers/freescale/gianfar.txt deleted file mode 100644 index ba1daea7f2e4..000000000000 --- a/Documentation/networking/device_drivers/freescale/gianfar.txt +++ /dev/null @@ -1,42 +0,0 @@ -The Gianfar Ethernet Driver - -Author: Andy Fleming -Updated: 2005-07-28 - - -CHECKSUM OFFLOADING - -The eTSEC controller (first included in parts from late 2005 like -the 8548) has the ability to perform TCP, UDP, and IP checksums -in hardware. The Linux kernel only offloads the TCP and UDP -checksums (and always performs the pseudo header checksums), so -the driver only supports checksumming for TCP/IP and UDP/IP -packets. Use ethtool to enable or disable this feature for RX -and TX. - -VLAN - -In order to use VLAN, please consult Linux documentation on -configuring VLANs. The gianfar driver supports hardware insertion and -extraction of VLAN headers, but not filtering. Filtering will be -done by the kernel. - -MULTICASTING - -The gianfar driver supports using the group hash table on the -TSEC (and the extended hash table on the eTSEC) for multicast -filtering. On the eTSEC, the exact-match MAC registers are used -before the hash tables. See Linux documentation on how to join -multicast groups. - -PADDING - -The gianfar driver supports padding received frames with 2 bytes -to align the IP header to a 16-byte boundary, when supported by -hardware. - -ETHTOOL - -The gianfar driver supports the use of ethtool for many -configuration options. You must run ethtool only on currently -open interfaces. See ethtool documentation for details. diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst index 7e59ee43c030..cec3415ee459 100644 --- a/Documentation/networking/device_drivers/index.rst +++ b/Documentation/networking/device_drivers/index.rst @@ -38,6 +38,7 @@ Contents: dec/dmfe dlink/dl2k freescale/dpaa + freescale/gianfar .. only:: subproject and html -- cgit v1.2.3 From cf7eba49b2b160f98106b33ca12039b05d812140 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 1 May 2020 16:44:46 +0200 Subject: docs: networking: device drivers: convert intel/ipw2100.txt to ReST - add SPDX header; - adjust titles and chapters, adding proper markups; - comment out text-only TOC from html/pdf output; - use copyright symbol; - use :field: markup; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/device_drivers/index.rst | 1 + .../networking/device_drivers/intel/ipw2100.rst | 323 +++++++++++++++++++++ .../networking/device_drivers/intel/ipw2100.txt | 293 ------------------- MAINTAINERS | 2 +- drivers/net/wireless/intel/ipw2x00/Kconfig | 2 +- drivers/net/wireless/intel/ipw2x00/ipw2100.c | 2 +- 6 files changed, 327 insertions(+), 296 deletions(-) create mode 100644 Documentation/networking/device_drivers/intel/ipw2100.rst delete mode 100644 Documentation/networking/device_drivers/intel/ipw2100.txt (limited to 'Documentation') diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst index cec3415ee459..54ed10f3d1a7 100644 --- a/Documentation/networking/device_drivers/index.rst +++ b/Documentation/networking/device_drivers/index.rst @@ -39,6 +39,7 @@ Contents: dlink/dl2k freescale/dpaa freescale/gianfar + intel/ipw2100 .. only:: subproject and html diff --git a/Documentation/networking/device_drivers/intel/ipw2100.rst b/Documentation/networking/device_drivers/intel/ipw2100.rst new file mode 100644 index 000000000000..d54ad522f937 --- /dev/null +++ b/Documentation/networking/device_drivers/intel/ipw2100.rst @@ -0,0 +1,323 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +=========================================== +Intel(R) PRO/Wireless 2100 Driver for Linux +=========================================== + +Support for: + +- Intel(R) PRO/Wireless 2100 Network Connection + +Copyright |copy| 2003-2006, Intel Corporation + +README.ipw2100 + +:Version: git-1.1.5 +:Date: January 25, 2006 + +.. Index + + 0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER + 1. Introduction + 2. Release git-1.1.5 Current Features + 3. Command Line Parameters + 4. Sysfs Helper Files + 5. Radio Kill Switch + 6. Dynamic Firmware + 7. Power Management + 8. Support + 9. License + + +0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER +================================================= + +Important Notice FOR ALL USERS OR DISTRIBUTORS!!!! + +Intel wireless LAN adapters are engineered, manufactured, tested, and +quality checked to ensure that they meet all necessary local and +governmental regulatory agency requirements for the regions that they +are designated and/or marked to ship into. Since wireless LANs are +generally unlicensed devices that share spectrum with radars, +satellites, and other licensed and unlicensed devices, it is sometimes +necessary to dynamically detect, avoid, and limit usage to avoid +interference with these devices. In many instances Intel is required to +provide test data to prove regional and local compliance to regional and +governmental regulations before certification or approval to use the +product is granted. Intel's wireless LAN's EEPROM, firmware, and +software driver are designed to carefully control parameters that affect +radio operation and to ensure electromagnetic compliance (EMC). These +parameters include, without limitation, RF power, spectrum usage, +channel scanning, and human exposure. + +For these reasons Intel cannot permit any manipulation by third parties +of the software provided in binary format with the wireless WLAN +adapters (e.g., the EEPROM and firmware). Furthermore, if you use any +patches, utilities, or code with the Intel wireless LAN adapters that +have been manipulated by an unauthorized party (i.e., patches, +utilities, or code (including open source code modifications) which have +not been validated by Intel), (i) you will be solely responsible for +ensuring the regulatory compliance of the products, (ii) Intel will bear +no liability, under any theory of liability for any issues associated +with the modified products, including without limitation, claims under +the warranty and/or issues arising from regulatory non-compliance, and +(iii) Intel will not provide or be required to assist in providing +support to any third parties for such modified products. + +Note: Many regulatory agencies consider Wireless LAN adapters to be +modules, and accordingly, condition system-level regulatory approval +upon receipt and review of test data documenting that the antennas and +system configuration do not cause the EMC and radio operation to be +non-compliant. + +The drivers available for download from SourceForge are provided as a +part of a development project. Conformance to local regulatory +requirements is the responsibility of the individual developer. As +such, if you are interested in deploying or shipping a driver as part of +solution intended to be used for purposes other than development, please +obtain a tested driver from Intel Customer Support at: + +http://www.intel.com/support/wireless/sb/CS-006408.htm + +1. Introduction +=============== + +This document provides a brief overview of the features supported by the +IPW2100 driver project. The main project website, where the latest +development version of the driver can be found, is: + + http://ipw2100.sourceforge.net + +There you can find the not only the latest releases, but also information about +potential fixes and patches, as well as links to the development mailing list +for the driver project. + + +2. Release git-1.1.5 Current Supported Features +=============================================== + +- Managed (BSS) and Ad-Hoc (IBSS) +- WEP (shared key and open) +- Wireless Tools support +- 802.1x (tested with XSupplicant 1.0.1) + +Enabled (but not supported) features: +- Monitor/RFMon mode +- WPA/WPA2 + +The distinction between officially supported and enabled is a reflection +on the amount of validation and interoperability testing that has been +performed on a given feature. + + +3. Command Line Parameters +========================== + +If the driver is built as a module, the following optional parameters are used +by entering them on the command line with the modprobe command using this +syntax:: + + modprobe ipw2100 [